Download

Response to Eugen Stoica (September 4, 2025) “Protecting University Repositories from Aggressive Web Scraping: Using Database Rights to Retain Control Over Academic Content”, Kluwer Copyright Blog

Andrew Johnson (The University of Sheffield)

September 11, 2025

Three rusty bars in front of a hazy coastal view

In his post of September 4th, 2025, author Eugen Stoica argued for an assertion of sui generis database right protection measures as a means to counter the problem of increased web-scraping and server traffic being experienced by many university repositories. In this response I will seek to explain why I find this argument fundamentally flawed in its reasoning, both in terms of what university repositories are, and in terms of how the proposed solution would fail to meet the desired objective.

I should note at the outset that I am not suggesting the problem of increased bot and crawler traffic is not a real one. As the Confederation of Open Access Repositories (COAR) report cited in the original post shows, the issue is very real for many repository teams. Nonetheless, the many technical means of countering AI web crawlers currently existing, and under development, seem a more logical and natural place for repository systems teams to start countering the problem.

Stoica notes, “we have detected scraping activity targeting our DSpace repositories, while similar content hosted in the Elsevier-managed Pure system has not experienced the same level of traffic”. The public-facing full-text nature of DSpace, as against the internal-only nature of the Pure Current Research Information System (CRIS), makes this discrepancy wholly unsurprising. This compares two quite different systems, and their respective discoverability and visibility would seem a more probable explanation than any contracts or safeguards. In thinking of ways to combat scraping that both make practical sense and don’t trample the ideas and ideals behind open institutional repositories, the author instead appears to advocate a path that compromises them.

Turning first to the proposed application of the sui generis database right as a counter to solve the problem, it is by no means clear this offers any practical solution. The right was established by Directive 96/9/EC (the Directive), and has been evaluated twice by the European Commission. In the UK, The Copyright and Rights in Databases Regulations 1997 (CRDR) enacted the Directive largely verbatim. There is by now substantial case law from the CJEU on the application and interpretation of the right (see for example British Horseracing Board Ltd v William Hill Organisation Ltd [2004] C-203/02, Football Dataco Ltd and Others v Sportradar GmbH and Sportradar AG [2012] C-173/11, CV-Online Latvia SIA v Melons SIA [2021] C-762/19, and also Ryanair Ltd v PR Aviation BV [2015] C-30/14 on the effect of a lack of qualification for the database right).

The second evaluation highlights the three main purposes of the Directive: “to harmonise

protection of databases, stimulate investment in them and safeguard the balance between the rights and interests of database producers and users”. Recitals 40 to 48 of the Directive emphasise the goal of protecting the risk taken by database makers by safeguarding and ensuring remuneration for their substantial investment. If we assume, for sake of argument, that the investment made in operating the repository qualifies for the database right then it is here, considering the right has the aim of ensuring “a means to secure the remuneration of the maker of the database” (recital 48 of the Directive), that the appeal to database right by university repositories meets its first serious obstacle.

Universities do not operate their repositories for remuneration, or with an expectation of financial return. Indeed, much of the growth in repository output volume has been driven by the UK Research Excellence Framework (REF) exercise since the first open access requirements were introduced in 2014. While UK REF return can yield substantial quality-related research (QR) funding for institutions from Research England, operating a repository is not essential to all QR funding acquisition due to the extent of so-called “gold” open access publishing, whereby research articles are published under Creative Commons licences by the journal publishers, thus satisfying open access requirements for REF without need for a repository version. The judgement in the early case of British Horseracing Board Ltd v William Hill (at paragraphs 90-91) notes that insubstantial extraction and/or reutilisation that does not reconstitute and/or make available a substantial part of the source database does not pose any threat to the investment made by the database creator. It is difficult to see how a university would argue that crawling of an open repository by bots is harming their investment or should be enforced as an unlawful extraction and reutilisation of a substantial part. The university actively wants people to access, copy and reuse the repository content, and does not seek to profit in any way from providing access to the literature. The author is objecting to an occasional side-effect of the level of copying, not the fact the copying is being done. Aggregation of university repository content by services such as Core, or by open access search and indexing tools, is generally done with the agreement of repository owners despite leading to the extraction and reutilisation of a substantial part. The appeal here is solely to technical difficulty which, while in line with recital 42 if it represents a qualitative detriment to the investment, is not an investment made with expectation of remuneration or reward.

The very openness of the repository content is a further weakness to the argument. While the original post notes the problems in any copyright-based defence against scraping due to standing and territorial questions, it is also important to note how much of repository content is open access under Creative Commons licences. In the UK, universities such as Edinburgh are now publishing a substantial proportion of their research articles under open copyright licences, noting in a blog post that the figure for 2022 in the UK overall was 45.7% - and likely significantly higher today. Edinburgh was the first UK university to adopt an institutional rights retention policy in 2022, under which accepted manuscripts of their authors’ articles are licensed CC BY in most cases, permitting any reuse of the protected work, including commercial ones, providing attribution is given. Whether CC licences are implicated in machine learning remains to be tested in court, and as noted for example in a 2024 IViR paper by Szkalej and Senftleben the AI training process may be reliant entirely upon exceptions and thus avoid any attribution requirement. Open institutional repositories exist in a non-profit capacity to allow free reuse of the freely written outputs by end users, in line with the original Budapest Open Access Initiative declaration. Implying that universities are having their investment harmed in a manner inconsistent with the protective purpose of the Directive/CRDR is at odds with the reality of repositories and much of their contents.

The open access purpose and ideals behind the early repository movement, along with the funder mandated requirements for repositories, do not sit easily with a move to restrict access. In the UK much of the open access outputs in university repositories are funded by UKRI via their open access block grant. While direct open access publishing under CC BY licence in a journal meets current UKRI open access policy requirements, according to the UKRI’s policy at Annex 2(5e), “route 2” policy compliance via university repository deposit is available only where the repository is registered in the Directory of Open Access Repositories (OpenDOAR) - such registration requiring free access with no barriers or log-in to enable access.

This seems at odds with the proposed click-through licence terms and restricted access required to make Stoica’s database right proposal workable. He identifies the problem of copyright being too weak an option where TDM occurs outside the jurisdiction, while ignoring the fact this obstacle applies equally to the database right. Terms of service (ToS) and clickwrap agreements will not help where a scraper ignores these and no enforceable contract is formulated or no web-scraping via a user account exists (an issue faced in the US by Meta in their suit against Bright Data), yet strengthening the access to a log-in violates funder policy terms underwriting much of the UK’s open access publishing.

Some asides are worth making regarding points in the original article. The author notes the s. 29A TDM exception in the UK CDPA does not apply to the database right, though as the problem scraping is already identified as commercial and out of scope this makes little difference. The absence of a TDM exception was not judged to be a problem in the government response to the Technical Review of Draft Legislation on Copyright Exceptions which noted (page 13) the fair dealing exception for the purpose of non-commercial research in the CRDR and that “The Government’s view is that this existing exception will permit the extraction of whole works if required for text and data mining”.

“Coordination across the sector” on enforcing access limitations to repository content that is published open access, often due to funder mandate, with cross-institutional author collaborations will be very difficult to achieve. In a time of increasing budgetary pressure in the HEI sector, expecting amicable sector-wide cooperation between institutions to turn free repositories into for-profit licensing opportunities via striking content deals seems overly optimistic.

As noted none of this dismisses the issues being faced by repositories, but we need to be clear that restructuring and repurposing university repository systems along the lines envisaged does not fit well with the open access movement and ethos underlying them.