Download

Does Using In-Copyright Works as Training Data Infringe?

Pamela Samuelson (Berkeley Law School)

July 18, 2025

More than forty copyright-related lawsuits have been filed in U.S. courts against developers of generative artificial intelligence (genAI) systems. The most common complaint is that the developers infringed by making copies of in-copyright materials when using them as training data for the foundation models that power their AI systems. The genAI developers’ main defense in the U.S. cases is that training data uses of those works is fair use.

Bartz v. Anthropic and Kadrey v. Meta are the first two decisions in which judges have analyzed training data-related fair use defenses. The defenses succeeded to some extent, but not completely. Both judges agreed that some training data-related uses of copyrighted works may be fair uses, while other uses may not be.

The Bartz and Kadrey judges’ main disagreements were about, first, the implications of using “pirated” books as training data for foundation models, and second, a novel “market dilution” theory that genAI outputs will cause copyright markets to be flooded with AI-generated works that will unfairly compete with human-authored works, such that human authors will be disincentivized from creating new works.

A Short Primer on Copyright and Fair Uses

Copyright law protects original works of authorship by giving their authors a set of exclusive rights to control exploitations of their works, most pertinently, the right to control reproductions of their works in copies. These rights are, however, limited by U.S. law’s fair use doctrine.

Fair uses are not infringements. When assessing fair use defenses, courts typically consider four factors: 1) the purpose of the challenged use; 2) the nature of the copyrighted work (e.g., how expressive is it?); 3) the amount and substantiality of the defendant’s taking from the plaintiff’s copyrighted work; and 4) the effect of the challenged use on the market for and value of the copyrighted work. Courts analyze each factor and then weigh them together in a holistic, balanced way.

Since the Supreme Court’s 1994 decision in Campbell v. Acuff-Rose, courts have given considerable weight to whether a putative fair user has a transformative purpose, that is, a purpose different from the purpose of the plaintiff’s work.

Courts recognize that transformative purpose uses are less likely to harm markets for copyrighted works (that is, to supplant demand for the original). The rap parody of a popular song in Campbell had a transformative purpose because it was a critical commentary about the song, which was why it was doubtful that anyone would buy Campbell’s parody if they wanted to own the original song.

Training Data Fair Use Defenses Go to Court

Bartz and Kadrey, both book authors, are the lead plaintiffs in class action cases against, respectively, Anthropic and Meta. They claim to represent the interests of authors whose books these genAI developers have used to train, respectively, Claude and Llama.

In March 2025, Anthropic asked Judge Alsup to rule that its uses of Bartz’ books were fair uses as a matter of law. That same month, Kadrey and Meta asked Judge Chhabria to rule on Meta’s fair use defense.

In June 2025 Judge Alsup agreed with Anthropic that two of its uses of Bartz’s books were fair, but disagreed as to a third use. Judge Chhabria ruled in favor of Meta’s defense, but only because the judge thought that Kadrey had made the wrong market harm argument.

Do Training Data Uses Have Transformative Purposes?

The first issue addressed in the Bartz and Kadrey decisions was whether these developers had transformative purposes when making copies of in-copyright works to train genAI models. Both judges said the uses had transformative purposes.

Bartz and Kadrey wrote their books to educate and/or entertain their readers. Anthropic and Meta had a different purpose, namely, to use book contents as training data by statistically analyzing the books’ contents to train foundation models.

Both judges thought these training purposes were highly transformative. However, they cautioned that this doesn’t necessarily mean that training data uses of works will be fair uses because the other three fair use factors must be considered as well.

When challenged uses have commercial purposes, that can weigh against fair uses. But this consideration typically weighs less heavily when the secondary uses have transformative purposes.

Was it Bad Faith to Use Pirated Books as Training Data?

Kadrey argued that Meta’s use of millions of pirated books to train its models was dispositive against fair use. Meta, by contrast, argued that it was irrelevant that its model was trained on pirated books. Judge Chhabria disagreed with both arguments.

The judge noted that the Supreme Court has twice suggested that what matters is whether the challenged use is objectively fair, not whether the putative fair user was a good or bad faith actor. He thought the pirated books issue did not “move the needle” on Meta’s fair use claim.

Judge Alsup’s Bartz decision, by contrast, was highly critical of Anthropic’s use of pirated books to train its models. “[P]iracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use [of training models] and immediately discarded.”

What About the Nature of the Copyrighted Works?

The nature-of-the-copyrighted-work factor rarely plays a significant role in fair use cases. In general, artistic, fanciful, and other highly expressive works enjoy a “thicker” scope of copyright protection than fact-intensive or functional works, such as instruction manuals or computer programs. Highly expressive works are, as the courts put the point, closer to the “core” of copyright.

Bartz and Kadrey are authors of popular books that Anthropic and Meta used to train genAI models. Both judges thought that the nature factor cut against fair use defenses because Anthropic and Meta had chosen to use the plaintiffs’ books because of their expressiveness.

Was it Reasonable to Copy Entire Works?

Making exact copies of entire works generally weighs against fair use. However, when defendants have transformative purposes in making copies, courts usually consider whether doing so was reasonable in light of the defendant’s transformative purpose. Both judges thought that Anthropic and Meta acted reasonably in copying the entirety of plaintiffs’ works for training data purposes.

Did Training Uses Cause Authors to Lose License Revenues?

Bartz and Kadrey both claimed that the defendants’ training data uses of their books had harmed them because these developers failed to get a license to use their works as training data. In support of their market harm arguments, lawyers for both plaintiffs submitted economic expert reports that pointed to emerging markets for training data licenses.

Anthropic and Meta submitted economic expert reports to counter Bartz’ and Kadrey’s experts. Their experts opined that no such markets exist or are feasible. The defense experts pointed to the massive scale of data that genAI developers need to use to train models as sophisticated and general as Claude and Llama.

To obtain licenses from professional writers such as the plaintiffs in Bartz and Kadrey would require Anthropic and Meta to engage in millions of license transactions. Given the difficulty of assessing how much each book may have contributed to the models’ value, Anthropic and Meta argue that transaction costs and uncertain valuations make licensing infeasible.

Neither Judge Alsup nor Judge Chhabria found the lost license fee market harm theory persuasive. However, neither judge based their conclusions on the economic experts’ reports. Both judges said that training data uses of books was simply a market that authors were not entitled to control. (There is case law support for the proposition that authors are not entitled to control certain kinds of markets for transformative purpose uses.)

What About Lost Sales?

Kadrey argued that Llama’s uses of his books had caused authors to lose sales. Judge Chhabria characterized this argument as a “clear loser[].” He credited Meta’s evidence that output filters prevent the model from regurgitating more than small chunks of expression from books ingested as training data.

Judge Alsup also found no such harm had occurred, since plaintiffs had conceded that Anthropic’s training copies didn’t displace demand for copies of their works in the market.

However, Judge Alsup had a different view on lost sales as to collections of pirated copies that Anthropic downloaded and stored in a database. Judge Alsup opined that Anthropic should have purchased print copies of the plaintiffs’ books to build its library and found against Anthropic on all four fair use factors for this particular use.

What About Market Dilution?

Meta won its fair use defense on the training data issue, but only because Judge Chhabria thought its lawyer made the wrong market effects argument. He chided Kadrey’s lawyer for having failed to present evidence that genAI outputs would harm copyright markets by diluting them.

The dilution theory posits that users of genAI systems will produce such large volumes of AI-generated works that copyright markets would be inundated with books that compete with human-authored works. As a result, human authors will no longer be willing to create new works and hence, unable to make a living from creating and disseminating their works.

The market dilution theory is novel and unprecedented in copyright law. During oral argument in the Bartz case, Judge Alsup called this theory “science fiction.”

Judge Chhabria said that plaintiffs in other genAI cases would likely “decisively win the fourth factor” if they, unlike Kadrey, produced evidence of market dilution. His concern was about “the rapid generation of countless works that compete with the originals, even if those works aren’t themselves infringing.”

The judge recognized that copyright cases usually focus on direct market substitution (i.e., infringing works that undermine markets for the plaintiff-author’s works). But he thought that generative AI outputs cause indirect market substitution and this should be given due weight.

Judge Chhabria speculated that certain types of works and certain types of authors would be more seriously affected than others. Memoirs, for example, might not be harmed by AI outputs, but markets for romance-genre novels, gardening books, and news articles might be swamped by AI-produced imitations. He conjectured that markets for famous writers’ books would be less likely to be harmed by AI-generated works because of the authors’ reputations for producing high quality and/or imaginative works. However, up-and-coming writers would likely find it hard to attract readers because of the huge volume of genAI outputs.

While he recognized that there is no copyright precedent that supports the market dilution theory, Judge Chhabria thought that this was because courts had never before been faced with a phenomenon akin to genAI that enabled such rapid proliferations of works into copyright markets.

This novel theory of market harm will unquestionably be challenged on appeal. In general, copyright markets are not harmed by the creation and dissemination of secondary works that do not embody substantial similar expressions taken from the works on which they based. Indeed, the law generally encourages the development of new non-infringing works. Judge Chhabria’s analysis of market dilution is built on one speculation after another.

Courts routinely find fair uses when copyright owners offer only speculative evidence of harm. (I could go on at some length criticizing this aspect of Judge Chhabria’s decision, but word limits constrain a more elaborate explanation.)

Judge Chhabria scoffed at Meta’s argument that the genAI industry would come to a halt unless developers’ fair use defenses succeeded. He expressed confidence that genAI developers would find a way to pay copyright owners for these uses. “It seems especially likely that these licensing markets will arise if LLM developers’ only choices are to get licenses or forgo the use of copyrighted books as training data.”

He signaled that courts may be reluctant to issue an injunction against genAI training. Instead, they could award some monetary relief to copyright owners for genAI uses of their works as training data.

Concluding Thoughts

Bartz and Kadrey are likely to ask appellate courts to review the rulings against their copyright claims. Anthropic may well ask for similar review of Judge Alsup’s ruling, especially as to its use of pirated data to train Claude. Appellate review will take some time, and the outcomes are uncertain.

The Bartz and Kadrey decisions are mixed bags. They indicate that some training data uses may be found fair use, while others may not. The Bartz decision gives a greener light to genAI training uses of works than does the Kadrey decision. Stay tuned.

This article is due to be published in Communications of the ACM in November 2025.