Download

Copyright in Formaldehyde: How GEMA v OpenAI Freezes Doctrine and Chills AI – Part 1

Giancarlo Frosio (G-IPTech Centre, School of Law, Queen’s University Belfast)

December 10, 2025

My impulse to write this piece came from a question at a recent Conference, where I was speaking about AI training, fair use and EU text-and-data mining (TDM). During the Q&A, someone asked about the fresh decision of the Landgericht München I in GEMA v OpenAI (42 O 14139/24, 11 November 2025). I answered a bit too briskly that I did not think the case deserved the weight people were giving it: in my view, it misreads how machine learning works, mislabels memorisation as “reproduction”, and arrives at the wrong policy conclusion at exactly the wrong time.

Since then, media coverage, collecting-society press releases and early academic commentary have started to cast GEMA as a landmark for AI training in Europe. That, I think, is dangerous. So, this post tries to do what I have not had the time to do in the conference room: slow down, unpack what the Munich court actually did, and explain why it is a poor candidate for setting the legal frame for AI training in the EU.

Part 1 of this post will outline the decision and place it in the action workflow of large language models (LLMs), as well as explaining why treating training as “reproduction”, in the way GEMA suggests, is technically and doctrinally misguided. Part 2 will highlight the broader policy costs of that move - for innovation, for Europe’s position in AI, and for copyright’s own idea-expression architecture.

1 What GEMA v OpenAI actually decided

The basic story is by now familiar. The German collecting society GEMA sued OpenAI before the Landgericht München I on behalf of music publishers and songwriters. It alleged that ChatGPT had “memorised” at least nine well-known German songs (including Atemlos durch die Nacht, Männer, Über den Wolken, In der Weihnachtsbäckerei) and could reproduce their lyrics almost verbatim when appropriately prompted.

The evidence consisted of chat transcripts annexed to the claim. GEMA’s lawyers tested both ChatGPT with model 4 and custom GPT-4o agents configured as “experts on song lyrics”. With model 4—already no longer OpenAI’s flagship model by the time of the proceedings—very simple, title-based prompts such as “What are the lyrics of [song title]?”, “What is the chorus of [song title]?”, followed by “Please also give me the first verse/second verse” sometimes produced long stretches of lyrics. For GPT-4o, the state-of-the-art model at the time, the most problematic outputs in the record arose when the plaintiffs deliberately role-primed the system as a lyrics specialist (with system messages along the lines of “you know all song lyrics and can reproduce them correctly and completely”), disabled web search, ran these “lyrics expert” agents under several different accounts, and probed them repeatedly with the same kind of title-based prompts. Under this combined set-up, the plaintiffs obtained outputs that, for each of the nine songs, contained substantial stretches of lyrics: in some cases entire refrains or verse-plus-refrain blocks rendered verbatim, in others a mixture of correct lines and hallucinated additions. Screenshots of these exchanges formed the core of GEMA’s “memorisation and regurgitation” case.

OpenAI responded that these outputs were atypical: they argued that the plaintiffs had “provoked” the results by systematically probing the models and by constructing “lyrics expert” agents precisely to pull GPT-4o off its default, more cautious behaviour. That nuance matters. What GEMA actually demonstrated is, at least in the GPT-4o scenario, targeted regurgitation under engineered conditions, not the default behaviour that an average user would encounter in a casual one-off query. It is also telling that, in my own testing at the time of writing, current ChatGPT deployments no longer output these protected lyrics at all—even when asked simply to translate the passages of the Munich judgment where the court itself reproduces them.

The court, however, was not persuaded by OpenAI’s account. It placed decisive weight on the fact that, at least on the surface, the prompts “looked simple”, and it treated the curated screenshots as sufficient proof that the songs were stored and reproducible from within the models. From there, it felt entitled to read the curated examples from 4o as revealing the “true nature” of AI training and to draw much broader conclusions about reproduction in the model and the legality of training as such. On that basis, it granted injunctive relief, information and damages. On the way, it made three important moves (simplifying a little):

It treated both the ingestion of lyrics into the training corpus and the subsequent memorisation within the model as acts of reproduction under German law (implementing Articles 2 and 3 of the InfoSoc Directive).
It rejected the application of the German TDM exception (§ 44b UrhG), on the theory that generative models “permanently memorise” lyrics and go beyond mere analysis.
It read the regurgitated outputs as proof that lyrics were “fixed” inside the model in a way comparable to a digital library.

Politically, the case has already been framed as a major victory of German authors and collecting societies against a large US AI provider. That, in turn, has fed into the broader debate on whether AI training must always be licensed in Europe.

Before we elevate GEMA into a precedent on “AI training”, it is worth asking a more basic question: what, exactly, in the model’s workflow is this court actually looking at?

2 The missing workflow: where do copies really live?

One of the recurring mistakes in legal debates on AI is to speak about “training” as a single monolithic act. From a technical perspective, there are at least five distinct stages:

Ingestion and preprocessing. Texts and images are collected, cleaned and tokenised. At this point, we clearly have ordinary copies on disks and in RAM, plus derivative token sequences. This looks very much like classic reproduction – but it is still conceptually separate from the trained model.
Training loop. The system iterates over the dataset, running forward and backward passes and updating billions of numerical weights. Here we mostly have transient technical copies in memory, overwritten as training proceeds. When training is finished, the raw corpus can in principle be deleted; what remains is a vector of parameters.
Model artefact. The saved model is essentially a huge matrix of weights. You cannot “open” it and read a song or a news article. It is a compressed statistical representation of patterns in the data, not a human-readable copy.
Generation (inference) and retrieval. At runtime, a user prompt is tokenised; the model computes a probability distribution over the next token and samples a sequence. Some systems add retrieval-augmented generation (RAG): the model first queries an external index, which does store real copies, then conditions on the retrieved documents. The legally sensitive act here is the output: if it is too close to a copyrighted work, we may have reproduction or communication to the public, regardless of what happened at training.
Logs and caches. Finally, prompts and outputs may be logged; some intermediate results may be cached for efficiency or safety review. These are further places where copies live, but typically short-lived and ancillary.

Once you draw this map, it becomes clear that GEMA is, in substance, a memorisation/output case. The evidence before the court is all about stage 4 (what ChatGPT can be induced to output), plus some assumptions about what that reveals about stage 3 (whether the model parameters “contain” those songs).

That is precisely where we need to be careful not to mistake a bug for a definition of the entire technology.

3 Memorisation and regurgitation: a bug, not the design goal

We know by now that large models sometimes memorise their training data. A line of computer-science work – from early extraction attacks on GPT-2 to more recent studies by Nicholas Carlini and colleagues – shows that, especially for short, frequent or highly regular material (such as song lyrics, code snippets or standard phrases), models can be coaxed into reproducing training examples.

We also know that this is a minority phenomenon. The same research finds that memorisation is:

highly concentrated in a small subset of the training data;
strongly correlated with over-representation (the more often an example appears, the more likely it is to be memorised);
extremely sensitive to prompting (most users never see it, while litigants and researchers can sometimes trigger it with adversarial prompts).

Engineers treat this as a behavioural defect to be mitigated: by de-duplicating datasets, tuning training objectives, adding regularisation, or using post-hoc alignment and safety filters. There is a fast-growing literature on “machine unlearning”—methods to remove specific data points or behaviours from a trained model without retraining from scratch.

From this perspective, the right lesson from the screenshots in GEMA is not “the model is a library of songs” but “this model still overfits badly on a small class of highly repeated lyrics, and its guardrails are insufficient”.

Courts, however, are being presented with carefully curated regurgitation experiments and asked to draw a much more sweeping inference: if these lyrics come out, the songs must somehow be stored in the model, and training must therefore be a massive act of reproduction. That is the slide we should resist.

The same evidentiary pattern is visible in The New York Times Company v OpenAI & Microsoft in the US, where the complaint dwells on instances in which the models output large portions of Times articles—often after the prompt has first fed in several paragraphs from the article and then asked the system to “continue” (see also OpenAI’s ‘hired gun hacker’ defense). Whatever one thinks about the merits of that case, it is important to understand that targeted regurgitation under adversarial prompting is not the same thing as continuous, systematic market substitution for ordinary readers.

4 Training as lossy compression, not hidden libraries

This brings us to the doctrinal heart of the matter: does training a model amount to a reproduction of all works in the corpus?

In EU law, the reproduction right under Article 2 InfoSoc covers the “direct or indirect, temporary or permanent” reproduction of the work, in whole or in part. Since Infopaq, we know that even 11 words can qualify if they reflect the author’s “own intellectual creation”. Pelham adds that there is no infringement if the fragment is used in a way that renders it unrecognisable—because copyright protects expression, not the mere fact of influence or inspiration.

Technically, training does involve full copies of works at the ingestion stage. It also involves transient copies in memory during the training loop. But the end product of training is a set of parameters – a high-dimensional, lossy statistical compression of correlations in the data. You cannot query the parameters to recover the lyrics of Atemlos durch die Nacht, any more than you can recover individual training images from the weights of a convolutional neural network, except in those pathological pockets of memorisation just discussed.

In that sense, model parameters sit in a space that copyright doctrine has always treated cautiously: they are closer to facts, statistics and functional logic than to expressive form. They belong on the “idea/technique” side of the idea–expression dichotomy.

It is striking that when the English High Court confronted similar arguments in Getty Images v Stability AI, it took a very different tack. Getty alleged, among other things, that Stability’s image models amounted to infringing copies of Getty’s training images. Mrs Justice Smith was not persuaded: she held that the model weights are not “copies” of the images for the purposes of UK copyright law, even though separate claims about the unauthorised creation and use of the training corpus remain live.

In other words, the UK court separates the pre-training acts (where real copying occurs) from the trained artefact (which it recognises as a transformed representation). GEMA does the opposite: it uses evidence of memorisation at the output stage to fold the entire process – ingestion, training, model – back into a single monolithic act of reproduction.

Part 2 is available here.

Tags: machine learning