Protecting University Repositories from Aggressive Web Scraping: Using Database Rights to Retain Control Over Academic Content

© Eugen Stoica

Universities have long served as custodians of knowledge, investing in infrastructure - such as institutional repositories - to ensure research outputs are preserved, discoverable, and freely available. These repositories uphold transparency, reproducibility, and public accountability in scholarship. However, the rise of generative AI and large language models has created an insatiable demand for high-quality text data, turning academic repositories into prime targets for large-scale automated scraping by commercial actors. Usually carried out without permission, attribution, or transparency, this activity repurposes open content for proprietary model training and leaves repository managers facing a difficult dilemma: maintain openness or guard against exploitation.

At the University of Edinburgh, this dilemma is very real. Unauthorised scraping is more than just a technical nuisance - though it often is, to the point of slowing systems or blocking access for legitimate users. The greater concern is that it quietly undermines the principle of open access by enabling scholarly content to be repurposed for commercial AI training without consent, attribution, or oversight. For months, we have detected scraping activity targeting our DSpace repositories, while similar content hosted in the Elsevier-managed Pure system has not experienced the same level of traffic. A recent survey by the Confederation of Open Access Repositories (COAR) echoed similar concerns. This discrepancy points to a broader structural issue: commercial platforms benefit from technical safeguards, paywalls, and enforceable contracts, while open repositories often lack these defences. As a result, public institutions - those most committed to open science - are paradoxically the most exposed.

This blog post proposes a strategic and legally grounded response. By asserting database rights - an often-overlooked form of intellectual property protection under UK and EU law - universities can reassert control over how their curated repositories are used. Importantly, this need not undermine openness. Rather, it allows universities to establish enforceable conditions that promote ethical reuse, transparency, and accountability.

 

Why Copyright Falls Short

Many assume that copyright can protect repositories from unauthorised reuse. In practice, it provides only limited recourse. Section 29A of the Copyright, Designs and Patents Act 1988 allows text and data mining (TDM) for non-commercial research, without a licence, as long as users have lawful access. Contract terms attempting to override this exception are unenforceable.

Yet most scraping for AI training is not non-commercial research. It is typically conducted by private firms operating with minimal transparency and no scholarly oversight. Still, enforcement under copyright law remains difficult. Moreover, universities often do not hold the copyright in these works. Authors commonly transfer rights to publishers, meaning the institution cannot impose additional restrictions.

Jurisdiction further complicates matters. Scraping bots often operate from outside the UK or EU, making legal claims harder to enforce. And even where copyright might be infringed, universities lack standing unless they own the rights in question.

In short, copyright is not a dependable tool for repository protection. A different approach is needed - one based not on authorship of content but on stewardship of the repository as a structured collection.

 

Database Right: A Legal Tool with Untapped Potential

The UK and EU grant a sui generis database right to those who make substantial investments in obtaining, verifying, or presenting database contents. This protection covers the structure and organisation of the database, even if individual items within it are not themselves protected by copyright.

Institutional repositories meet this threshold. Universities invest heavily in ingest processes, metadata creation, validation, curation, and long-term preservation. Staff expertise, IT infrastructure, and administrative oversight all contribute to maintaining these platforms. That investment qualifies repositories for database right protection.

The database right offers distinct advantages. It is independent from copyright and unaffected by the TDM exception. It allows institutions to control substantial reuse of the database, including mass extraction for AI training. And it enables enforcement based on the misuse of the dataset as a whole, rather than tracking individual documents.

 

Making the Database Right Work: A Layered Strategy

Asserting database rights requires no legal registration but must be done clearly and visibly. Repositories should include statements on their homepages and in documentation declaring database rights and specifying reuse conditions. These declarations reinforce existing legal protections and signal intent to enforce them.

Next, universities should introduce access licences - ideally click-through agreements or API terms of use. These licences might allow reading and non-commercial research, while reserving the right to authorise or deny uses such as AI training. By linking these conditions to database rights, institutions gain stronger grounds for enforcement.

To be effective, these licences should be machine-readable. Various metadata standards can be adapted to communicate terms to bots and aggregators. This strengthens legal positions and undermines claims of unintentional infringement.

Technical safeguards complement the legal strategy. CAPTCHA, rate limiting, and IP throttling can discourage automated scraping without affecting human users. Structured API access can offer legitimate users a better alternative, while providing monitoring tools for the institution.

 

Navigating Commercial Licensing for AI Training

Enabling access to repository content for commercial AI training is complex - but not beyond reach. The key is ensuring that any such use is lawful, ethical, and based on valid permissions. Institutions should assess their legal position carefully and decide if this route is appropriate in their particular context.

Universities can only license works they control. In the UK, this generally means content for which they hold copyright - such as internal reports, policy documents, funded project deliverables, and some staff-authored outputs. It excludes most scholarly works, which are typically owned by academic publishers.

An important exception involves open licences. Creative Commons Attribution (CC BY) is especially permissive. On 15 May 2025, Creative Commons clarified that CC BY–licensed works may be used for AI training, provided copyright law permits it and attribution is maintained. Attribution could be a simple link to the source of the dataset used to train the model or a technical method, such as retrieval-augmented generation (RAG).

Other Creative Commons licences present challenges. CC ND (No Derivatives) generally prohibits AI training, since outputs from training are arguably derivative works. CC NC (Non-Commercial) restricts use to non-commercial contexts, excluding many AI companies. CC SA (ShareAlike) imposes reciprocal licensing requirements, which are often incompatible with commercial use.

With some clever advertising, this initial pool of research outputs could support a prototype “AI-ready” dataset, especially in collaboration with other institutions. As open licensing continues to grow through Rights Retention policies, funder mandates and the requirements for the forthcoming REF, the potential for responsibly licensed data increases.

 

Policy Change on the Horizon?

It is very possible that, depending on the outcome of debates in the Westminster Parliament around the Data Protection and Digital Information Bill and the forthcoming AI Bill, UK universities - under mounting financial pressure and increased scrutiny - may reconsider their long-standing position of not claiming copyright over scholarly works authored by employees. Traditionally, this hands-off approach has supported academic freedom and allowed researchers to publish without institutional interference. But it has also enabled publishers to acquire broad rights, often on an exclusive basis, and now to monetise those same works through commercial AI licensing.

As generative AI adds yet another layer of value extraction, universities may begin to question the logic of this model. They fund the research, employ the researchers, and maintain the infrastructure - yet see little return from downstream commercial uses. A policy shift to assert institutional copyright - at least in limited contexts such as AI training - could give universities greater leverage and a share in the value they help create. While such a change would be complex and potentially controversial, it could help rebalance incentives across the scholarly communication system.

 

Strategic Alignment and Readiness

The strategy proposed here aligns with broader policy initiatives. The UK Government’s National Data Strategy and proposals for a National Research Data Cloud call for responsible data stewardship. The Department for Science, Innovation and Technology’s recent frameworks likewise stress ethical governance. Internationally, the UNESCO Recommendation on Open Science and EU initiatives like the Data Governance Act highlight the importance of legal clarity, interoperability, and fairness.

By asserting database rights and adopting enforceable licensing models, universities can play a proactive role in shaping the research data ecosystem. They can enable responsible AI development, protect academic values, and maintain public trust.

Crucially, this approach requires no new legislation. The tools already exist. What’s needed is institutional awareness, technical capacity, and coordination across the sector.

 

Conclusion

The ideals of open access - collaboration, transparency, public benefit - must be preserved. But they must also be protected. As AI systems become more sophisticated and data-hungry, universities must ensure that their generosity is not exploited.

By asserting database rights, deploying technical safeguards, and implementing clear licensing conditions, institutions can govern access ethically and effectively. They can invite reuse - but on fair terms. They can support innovation - but not at the cost of losing control over publicly funded knowledge.

This is not a rejection of AI. It is an invitation to engage with it on terms that respect scholarly integrity and institutional agency. With foresight and collective action, universities can shape a future where openness and control coexist - not in conflict, but in balance.

 

Comments (0)
Your email address will not be published.
Leave a Comment
Your email address will not be published.
Clear all