The legal status of AI training data is no longer an open void. A rule is forming out of the first wave of rulings and settlements, and it is more specific than either side predicted. Training a model on copyrighted material can qualify as fair use. But how you obtained that material, and whether the model’s output competes with it, can turn the same act into enormous liability. The $1.5 billion Anthropic agreed to pay is the price of getting the first half right and the second half wrong.

The ruling that drew the line

In Bartz v. Anthropic, Judge William Alsup issued a split decision that is now the template. He found that training an AI on copyrighted books can be transformative fair use, but downloading and storing pirated copies to build the training library is not. After that ruling, the case settled for $1.5 billion, roughly $3,000 per work, one of the largest copyright settlements in US history. The act of learning from the books was defensible. The act of pirating them to do it was not, and that distinction carried a ten-figure price.

The same fault line runs through the other cases. In Kadrey v. Meta, the court accepted fair use for training the model but left alive the claims about pirated works acquired through torrenting. Where plaintiffs can show the model reproduces their work or that the corpus was pirated, the transformative defense weakens fast.

Why the blanket fair-use defense is failing

Labs leaned on a single argument: training is transformative because the model learns patterns rather than storing works. Courts have been skeptical for three concrete reasons. Plaintiffs in NYT v. OpenAI showed the model reproducing substantial verbatim portions of their articles, which undercuts “transformative”; the Bartz case, by contrast, turned on how the training corpus was acquired rather than on any infringing output. The use is commercial and generates billions, which weakens fair use. And the output competes in the same market as the source. The US Copyright Office sharpened that last point in a May 2025 report: when AI training competes with existing licensing markets for the original works, the analysis tilts against the AI company.

The emerging rule: provenance and competitionDefensibleTraining on lawfully acquired or licensed data,where output does not reproduce the sourceLiabilityPirated or unlicensed source material, oroutput that reproduces and competes with the originalSource: Bartz v. Anthropic ruling; US Copyright Office report (2025).

The fight is going global, and toward licensing

This is no longer a US-only story. In November 2025 the Munich Regional Court held OpenAI liable in GEMA v. OpenAI for memorizing and reproducing copyrighted song lyrics, among the first major European rulings against an AI developer on training data, while the UK High Court rejected a secondary-copyright claim in Getty v. Stability. NYT v. OpenAI grinds on, with plaintiffs building evidence that pirated books were downloaded from the LibGen shadow library and stored to mirror the Bartz theory, and is widely expected to settle. With more than 70 copyright suits now filed against AI companies, the industry response is shifting from litigation to licensing: Warner Music’s suit against Udio resolved into a partnership rather than a verdict.

The altitude shift

Follow a single engineering shortcut to its invoice. Years ago, pulling a corpus of books or articles off a torrent was a convenience, a way to get training data fast and free, and nobody priced the risk. Today that same shortcut is a class action with statutory damages, and in Anthropic’s case it resolved at roughly $3,000 for every book in the pile. The model that learned from the text was on solid legal ground. The decision to acquire the text by pirating it is what converted a research practice into a balance-sheet event. The law did not ban training. It priced the shortcut, and the price is high enough to change how the next corpus gets built.

The rule to operate by

If you build models, the era of “it was publicly available” as a defense is closing, the same way the broader rulebook that already binds model builders stopped being optional. Provenance is now a legal feature of your training set, not a footnote, and licensed or lawfully acquired data is the only version that survives a verbatim-reproduction challenge. If you buy or fine-tune someone else’s model, training-data provenance belongs on your diligence checklist next to security, uptime, and the political risk every AI roadmap now carries, because the liability can attach downstream. Ask the vendor where the data came from. If the honest answer is “the open internet,” you have found a risk, not a reassurance.

The Counter Brief — one email, every Monday.

The week's AI-for-revenue moves in a 5-minute read: which tools are worth the budget and which to skip, plus what to do this week. Source-checked, no vendor decks.

Edited by Aditya Marin Gasga

Free. One click to unsubscribe.

Frequently asked questions

Is it legal to train AI on copyrighted material?

It can be. In Bartz v. Anthropic, a court found training on copyrighted books can be transformative fair use, but acquiring and storing pirated copies to build the training set is not. The legality turns heavily on how the data was obtained and whether the output reproduces or competes with the original.

Why did Anthropic pay $1.5 billion?

The settlement, roughly $3,000 per work and among the largest copyright settlements in US history, resolved claims that Anthropic downloaded and stored pirated books as training data. The court had found that piracy fell outside fair use even though training itself could qualify.

What is the test courts are applying?

They weigh whether the use is genuinely transformative, whether the model reproduces substantial portions of the work, whether the use is commercial, and whether it competes with the original's market. A US Copyright Office report notes that competing with existing licensing markets tilts the analysis against the AI company.

What does this mean for companies using AI?

Training-data provenance is now a diligence question. Builders should use licensed or lawfully acquired data, and buyers of AI models should ask vendors where training data came from, since liability for unlicensed data can carry downstream.

About Aditya Marin Gasga

Founding Editor

Aditya Marin Gasga is the founding editor of The Counter Brief and Head of Growth at Demand Nexus, its parent company, where he works on sourcing qualified pipeline across SDR, content, and paid channels. His background is in performance marketing and demand generation. He studied business administration at Northumbria University.

More from Aditya Marin Gasga →