Copyright in Formaldehyde: How GEMA v OpenAI Freezes Doctrine and Chills AI – Part 2
December 11, 2025
Part 1 of this post outlined the decision in GEMA v OpenAI and placed it in the action workflow of large language models (LLMs) as well as explaining why treating training as “reproduction”, in the way GEMA suggests, is technically and doctrinally misguided. This Part 2 will highlight the broader policy costs of that move - for innovation, for Europe’s position in AI, and for copyright’s own idea-expression architecture
5 Evidence games and the neutral-tool problem
A further problem in GEMA is the way liability is allocated.
ChatGPT is a general-purpose conversational system. It can answer homework questions, draft emails, summarise news, write code, make up poems—and, under certain prompts, regurgitate lyrics. The screenshots before the court are the product of highly specific prompting. They show that, if you ask hard enough, you can get the model to spit out long passages of lyrics.
The legal question is how we characterise this behaviour. EU intermediary-liability law and US doctrine have, for two decades, built a fairly robust framework around this problem:
On the EU side, the E-Commerce Directive and now the DSA protect hosting and caching providers against strict liability, as long as they act expeditiously on specific notices and do not engage in general monitoring. The CJEU’s YouTube/Cyando line confirms that providers are not automatically communicating each user upload to the public; duties turn on knowledge, control and targeted injunctions.
On the US side, the “volitional conduct” requirement for direct infringement – developed in cases like Cartoon Network v CSC and Perfect 10 v Giganews—asks who actually presses the button that causes the reproduction. Neutral tools are not directly liable for what users choose to do with them; secondary liability and notice-and-takedown do the heavy lifting.
Seen through that lens, most regurgitation evidence in GEMA and in the New York Times case is evidence that users, under very specific prompts, can induce unlawful outputs from a neutral tool that was never designed to serve as a lyrics database. The appropriate response is therefore twofold:
impose negligence-based duties on providers to deploy reasonable guardrails (deduplication, filters, unlearning, notice-and-action);
keep primary attribution with the user who chooses to elicit and exploit the regurgitated material.
GEMA does almost the opposite. By treating memorisation as proof that the songs are “inside” the model, it re-centres liability on the mere act of training and model deployment, rather than on the combination of flawed guardrails and opportunistic prompting.
6 Market harm, trivial lyrics and the wrong side of the trade-off
There is also something faintly surreal about the underlying market story in GEMA.
The works at issue are song lyrics of the most conventional sort – Christmas songs, birthday songs, pop ballads – many of which are already widely and freely accessible online, on commercial or quasi-official lyric sites and fan pages. This is not a case about leaked news scoops, scientific articles locked behind paywalls, or high-value educational content. It is hard to see a meaningful incremental harm to lyric markets in the fact that, under certain prompts, ChatGPT can reproduce verses that any motivated user can find with a basic search.
By contrast, the social upside of training is very large. General-purpose “cultural machines” that have read a wide swathe of human textual production are increasingly part of how we write, read, translate, search and create. Blocking or pricing training as if it were the core commercial exploitation of each individual song or article would raise entry barriers dramatically—and it would do so in a region that is already an AI laggard.
The numbers are not pretty. According to the European Commission, in 2023 AI venture capital investment in the EU was about 8 billion USD, compared to 68 billion in the US and 15 billion in China. Stanford’s 2025 AI Index counts around forty “frontier” models originating in the US, fifteen in China and just three in Europe. McKinsey, the OECD and the European Court of Auditors all tell versions of the same story: the US and China dominate frontier-model development and private AI investment; the EU has world-class regulation – the AI Act – and respectable public research, but trails badly in venture capital, scaling and deployment. Europe has built perhaps the most sophisticated ex ante regulatory framework for AI in the world, while contributing only a small fraction of the frontier models and commercial AI platforms that now structure global markets. From a policy perspective, this is a dangerous combination: in what is plausibly one of the major technological revolutions in human history, a region that does not develop AI risks becoming a rule-taker rather than a rule-maker.
In that light, expansive copyright paradigms of the kind GEMA represents are more than just doctrinal curiosities. They are close to a perfect case study in how Europe can hobble itself. Faced with a choice between protecting a marginal market—licensed access to song lyrics already freely circulating online—and enabling the emergence of competitive European AI models, GEMA effectively privileges the former. It picks the market for lyrics (sic) over the market for AI.
Against that background, GEMA is a textbook example of how over-protective copyright doctrine can hard-wire structural disadvantages into an emerging general-purpose technology. It turns a marginal market—licensed access to song lyrics already freely circulating online – into a veto point over AI training, at a moment when Europe is finally trying to grow its own models.
It is difficult to avoid the impression that the court is particularly receptive to a narrative in which German authors and publishers are pitched against a large US tech firm. That is not a uniquely German bias; most jurisdictions have had moments of doctrinal exuberance when local industries square up to foreign platforms. But it is precisely why we have higher courts and, in the EU, the CJEU: to discipline local impulses with a more systemic view of internal-market integration, innovation and fundamental rights.
7 The teleology problem: literalism, dichotomies and functionality
Underneath all this sits a deeper doctrinal issue. The kind of reasoning endorsed in GEMA—“there is a digital copy somewhere in the pipeline, therefore there is reproduction, therefore we need a licence or an exception”— is not just a mistake about technology. It is the product of a particular literalist interpretative stance in European copyright law: a strongly textual, provision-driven reading of Article 2 InfoSoc and its national implementations, in which the mere presence of a digital copy becomes almost self-evidently decisive.
In that tradition, the judge is still tempted to act as la bouche de la loi—the “mouth of the law”—reciting the literal breadth of the reproduction right and stopping there. The idea–expression dichotomy is acknowledged in theory but rarely allowed to do real work. Functional considerations, market effects and technological design features are pushed to the margins. Teleological or progressive interpretation—reading norms in light of their purpose, the structure of the system and the externalities they create—takes a back seat.
That approach was already under strain with the rise of caching, streaming and search. It required the EU legislature to bolt on Article 5(1) InfoSoc (transient copies) as a kind of safety valve to prevent the entire internet from becoming one long infringement. With AI training, the cracks widen further:
First, if every technical copy in a machine-learning pipeline is treated as a full-blown exercise of the reproduction right, then copyright effectively extends to facts, statistics and functional learning processes. That is hard to reconcile with the internal architecture of copyright, and particularly with the idea–expression dichotomy that has long been used—much more consistently in US law—to keep methods, systems and data structures outside the monopoly.
Second, such an expansive reading is difficult to justify on law-and-economics grounds. It multiplies transaction costs and legal uncertainty precisely in those areas—intermediate technical uses, network infrastructure, general-purpose tools—where ex ante licensing is least feasible and ex post, use-specific remedies are most efficient. A teleological or progressive approach would instead ask what interpretation maximises positive externalities (innovation, access, follow-on creativity) while containing negative ones (substitutive copying, free-riding).
Third, it pushes courts into the role of protecting moribund business models (lyrics collecting, photo libraries, etc.) at the expense of enabling new forms of knowledge access and creativity. That is the opposite of what copyright’s constitutional narratives—from the French droit d’auteur rhetoric of personality and culture (the authentic Le Chapelier “gift to the public” rhetoric, not its later copyright-maximalist travesty) to the US Progress Clause—say they are supposed to do.
The brute fact is that a neutral, general-purpose cultural machine that learns from a broad corpus of human expression and then generates an unbounded variety of new outputs sits much more naturally on the “idea/knowledge” side of the dichotomy than on the “fixed work” side. Copyright should police those outputs when they substantially reproduce protected expression, not the underlying act of learning.
Intermediate digital copying has always been a liminal case. The better view, in my opinion, is that:
Lawful intermediate technical copies that are integral to a non-substitutive analytic process – crawling, indexing, TDM, training – belong either outside the scope of the reproduction monopoly altogether, or within a carefully circumscribed corridor of exceptions (Article 5(1) InfoSoc, Articles 3–4 CDSM).
Attempts to re-property-ise those copies are less about doctrinal purity than about rent-seeking by rightsholder industries that have grown comfortable with expanding notions of reproduction and communication. They risk turning copyright into a general control instrument over the functionality of digital technologies.
GEMA is a case study in how this dynamic plays out when courts are not attentive to the technological and economic stakes.
8 A better way forward: target extraction, not learning
None of this is to deny that memorisation and regurgitation raise real problems. If a model systematically spits out full lyrics or whole news articles under everyday prompts, we have a classic case of substitution and should respond accordingly.
The point is that we have tools to do so without collapsing training into infringement.
On the provider side:
Deduplicate and de-bias training datasets to reduce over-representation and memorisation risk.
Implement prompt and output filters that make it hard to elicit verbatim copyrighted material.
Use unlearning and model-editing techniques to remove specific memorised content once it is identified.
Maintain logs and transparency reports – and, in the EU, comply with the AI Act’s obligations on copyright policies and training-data summaries for general-purpose models.
On the remedies side:
Favour proportionate, targeted injunctions that require concrete guardrails or adjustments to training corpora and RAG indexes, rather than blanket bans on training.
Use notice-and-action mechanisms to address specific regurgitation behaviours and outputs.
This is also where debates on remuneration belong. If we over-enclose training, we risk entrenching a few incumbents who can afford massive licensing deals, while locking out open, public-interest and smaller commercial models. If, instead, we use fair use and TDM in a principled way, combined with targeted remedies and remuneration for genuinely substitutive outputs, we can keep entry barriers reasonable without abandoning authors.
If we want a “creative dividend” from AI, it should not take the form of a per-file “training tax” that turns every transient copy into a tollbooth. Much better to design collective, output-level mechanisms – levies or statutory/extended collective licences on high-turnover generative services whose outputs demonstrably compete with human works, funded from revenue and distributed via CMOs on the basis of usage data – while leaving TDM exceptions intact for research, non-substitutive uses and open models.
We pay where there is real substitution, not where there is mere ingestion.
9 Conclusion: don’t let GEMA set the frame
GEMA v OpenAI is being sold as Europe’s first big AI-training case. In reality, it is a narrow, fact-specific memorisation dispute dressed up as a referendum on training—and it gets the technology, the doctrine and the policy calculus wrong. The better path is to:
keep training within the corridor defined by transient-copy and TDM rules;
keep output liability focused on users and specific substitutive uses, with negligence-based duties for providers; and
keep reproduction and remuneration lines tied to recognisable expression and real market substitution, not every electrical echo of a work in the bowels of a training pipeline.
If we manage that, we will have copyright doctrine that is faithful to the technology, honest about the markets, and still capable of doing what it was meant to do: sustain a living, evolving culture, rather than embalming yesterday’s business models in legal formaldehyde.
You may also like