Download

A Blind Spot at the Heart of EU Copyright and AI Policymaking?

Paul Keller (Institute for Information Law (IViR))

May 27, 2026

Earlier this month the European Commission published a call for evidence relating to the "Report on the review of the Copyright in the Digital Single Market Directive" and a "Targeted initiative for a better copyright environment for European creativity and innovation". While not unexpected, this publication marks a significant moment in EU copyright policy. In it the Commission formally confirms what everybody had been expecting for a while now — that it is planning a new legislative intervention in the EU copyright framework, which it expects to include measures aimed at addressing "challenges faced by creators, notably market and technological developments linked to AI". The call for evidence makes it clear that this intervention is — for now — separate from the review of the 2019 Copyright in the Digital Single Market Directive, which is ongoing in parallel but which at this stage is not expected to result in a re-opening of the text of the Directive.

Instead, the measures aimed at dealing with AI would be part of a "Targeted initiative for a better copyright environment for European creativity and innovation". The term targeted should be taken with a grain of salt here, since the call for evidence lists 4 separate areas of intervention that cover entirely different parts of the copyright framework: (1) "challenges faced by creators [...] linked to artificial intelligence" (2) online piracy of live events (3) sound recordings of third-country nationals (4) measures facilitating the use of protected content in research (by clarifying or further harmonising the InfoSoc research exception and potentially introducing an EU-level secondary publication right).

Like the 2019 CDSM Directive before it, the initiative is shaping up to be a legislative Christmas tree. This post focuses on what is likely to be the most consequential of the four areas: further legislative measures to adapt the copyright framework to the disruption caused by generative AI.

The (return of) the value gap

Seven years after the adoption of the CDSM Directive, it is clear that the TDM exceptions have not delivered on their central promise: enabling data-driven technology development while giving creators and rightholders control over and a meaningful basis for participating in the value generated by such uses

This promise largely depended on the opt-out mechanism introduced in Article 4(3) of the Directive which, despite ongoing efforts to make it work, does not function as a genuine licensing lever. The reasons for this are partly technical, there is still no widely adopted standard for expressing rights reservations, but also structural: the incentives around the mass-scale use of publicly available information and content are shaped by dynamics that extend far beyond the reach of the EU copyright framework. Most of the AI developers whose systems are trained on European content operate outside the EU, and the attempt to extend the reach of EU law through the AI Act's obligations for General Purpose AI model developers1 has so far shown little practical impact on their behaviour. The TDM opt-outs thus offer creators and rightholders hypothetical protection at best rather than the leverage they require for negotiating with AI developers.

The result is that the EU copyright framework has so far failed to prevent the large-scale uni-directional extraction of value from the information and cultural sectors by AI developers. Unlike in the UGC platform discussions that led to the 2019 Directive, this time there is a real value gap, and a legislative intervention is justified.

But justified does not mean straightforward. The risk is that the EU legislator moves to legislate without a solid analytical understanding of the different kinds of uses that need to be regulated in an information ecosystem increasingly shaped by generative AI. Here the Call for Evidence gives some grounds for that concern.

The analytical gap

The Call for Evidence treats the use of works by AI as a single phenomenon. To anyone who has been following the legal debate around generative AI, or the technical realities of how these systems are built and deployed, it should be obvious that this is not the case. The generic framing ‘use of works by AI’ papers over distinctions that are legally and economically fundamental. Training a model on copyrighted content and using copyrighted content at inference time are distinct acts, and conflating them at this stage will produce a legislative instrument that is incoherent.

The two uses are legally distinct acts...

Training generative AI models generally involves reproduction of large amounts of works and other content to create a statistical model. While there may be edge cases that are actively being litigated (see GEMA v OpenAI), the resulting models do not contain the vast majority of the works that they have been training on in any recoverable sense. They contain numerical values (weights) derived from them. Here the relevant copyright question is whether the reproductions made during training require authorisation, and the EU copyright framework provides a clear basis for making this determination in the form of Articles 3 and 4 of the CDSM Directive.

On the other hand, uses at inference time — the use of works as inputs into trained models and deployed systems, whether through direct human input, retrieval-augmented generation, real-time search grounding, or similar mechanisms — operate differently. Here the question is whether serving a user response that draws on a specific copyrighted work involves, in addition to technical reproductions, a communication to the public. It also raises questions about which exceptions and limitations apply to acts performed through these systems at the direction of users. This is legally much less settled territory, but there is a technical difference from training that is analytically significant: at inference time it is often possible to trace a direct relationship between specific works used as inputs and the output generated. This makes these uses both more attributable and more controllable than uses at training time.

... and have different economic logics

At training time, the value of any individual work is marginal and largely unquantifiable. A model trained on billions of works and other types of content cannot meaningfully attribute value to any one of them. Research on so-called influence functions, which sought to provide a technical solution to the training-time attribution problem, has made incremental progress but continues to face fundamental limitations that prevent it from scaling to the size and complexity of modern training datasets (see Grosse et al., Studying Large Language Model Generalization with Influence Functions, 2023, and Hammoudeh & Lowd, Training Data Influence Analysis and Estimation: A Survey, 2024). Thus the training data attribution problem remains unsolved in practice.

This is compounded by the fact that training datasets typically contain large amounts of non-copyrighted or unmanaged content alongside protected works, which makes it structurally difficult for any individual rightholder to establish a meaningful claim on the economic contribution of their works to a given model. Attribution is further complicated in settings involving distillation, where a model is trained on the outputs of another model rather than directly on original works. Finally, unlike inference, training is not a continuous transactional relationship with specific works — it is a one-time event for any individual model, and a relatively infrequent one even at the level of individual companies.

All of this makes individual or bilateral licensing at training time unworkable in practice: any attempt to negotiate a licence runs immediately into an unsolvable valuation problem.

At inference time, the relationship between specific works and the value generated is potentially more traceable — particularly in retrieval augmented generation (RAG) systems where the source document is actively retrieved and used. The picture is more complex for transformation-heavy uses such as summarisation or translation, where existing exceptions and limitations may apply to some or all of the acts involved, depending on the specific use.

Unlike training, inference is a recurring process: every response generated by a deployed system is a discrete transaction, which makes value attribution technically feasible in a way that retrospective attribution for uses at training time is not. Information providers are also in a much stronger position when it comes to controlling access to their works at inference time — paywalls, access controls, opt-outs and other preference signals can function as meaningful leverage here in a way they do not during training, where the gap between content being made available and its incorporation into a model makes timely intervention structurally difficult.

Taken together, these factors mean that retrieval based inference-time uses are structurally more amenable to licensing arrangements — which is why the majority of known licensing/partnership arrangements involve these uses.

The Call for Evidence conflates these two uses

An intervention aimed at "Improving rightholders' access to information on the use of their content" means something very different depending on whether it refers to training-time use or inference-time use. For training, the transparency obligation under the AI Act already applies — although it is clearly not sufficient. For inference, the transparency question is about real-time attribution, which is a different technical problem — but one that is likely solvable given strong market incentives for a solution. Both information providers and AI developers have an interest in robust attribution at inference time. The former because it is a precondition for any licensing or remuneration model, the latter because verifiable sourcing is increasingly a valuable property of the products they are developing.

Similarly, “facilitating licensing through mediation or arbitration” is only a coherent option for inference-time uses where the value of individual works to specific transactions can be assessed. Applied to training, it would either produce a de facto blanket licence scheme or be completely unworkable at scale. Any intervention aimed at ensuring remuneration for training-time uses will need to look fundamentally different from a licensing mechanism, and closer to a collective or levy-based model that does not depend on individual attribution.

Whether or not the failure to make these distinctions at the call for evidence stage reflects the constraints of the format or a genuine analytical gap in the Commission's current thinking, the practical consequence is the same. The categories established at this stage will shape the structure of the impact assessment, which will in turn shape the structure of the legislative proposal. Once the two uses are bundled into a single problem definition, that conflation becomes irrevocably embedded in the legislative process and extremely difficult to disaggregate later.

In the light of this, the Commission should treat the training/inference distinction as an analytical prior: it needs to be established before policy options are considered (see also Primavera de Filippi, Redevance IA: risques et opportunités, La Tribune, 9 April 2026, who makes a similar analytical distinction and draws out its implications for levy design). This also means that stakeholders responding to the Call for Evidence should press for this distinction explicitly as a legislative instrument that fails to distinguish between these two fundamentally different uses will either collapse under the weight of its own incoherence or produce outcomes that serve neither creators nor the broader information ecosystem.

Image: “European Commission Flags” by LIBER Europe, CC BY 2.0

1Articles 53(1)(c) and (d) of the AI Act require providers of General Purpose AI models to comply with EU copyright law and publish summaries of training data, regardless of where the provider is established.