The training data question just got a price tag, and Meta's betting it won't stick.

The Summary

  • Five major publishers and author Scott Turow sued Meta for allegedly scraping books and journals from pirate sites like LibGen and Sci-Hub to train Llama models without permission
  • Publishers claim Meta "engaged in one of the most massive infringements of copyrighted materials in history" by knowingly using pirated content at scale
  • This lawsuit draws a direct line from AI training to piracy infrastructure, making the provenance of training data a central legal battleground

The Signal

Meta's being sued by Macmillan, McGraw-Hill, Elsevier, Hachette, Cengage, and novelist Scott Turow in a class action that cuts straight to the question every AI company has been dodging: where exactly did your training data come from? The publishers aren't alleging accidental infringement or gray-area fair use. They're claiming Meta deliberately pulled copyrighted books and academic journals from notorious pirate sites and fed them into Llama.

The named sources read like a piracy hall of fame: LibGen, Anna's Archive, Sci-Hub, Sci-Mag. These aren't random torrent sites. They're massive repositories that academic researchers use when their institutions won't pay Elsevier's ransom prices, and that readers use when a book costs $40 in hardcover. Meta allegedly treated them as a free all-you-can-eat training buffet.

"Meta engaged in one of the most massive infringements of copyrighted materials in history."

Here's what makes this different from earlier AI copyright suits: specificity. Previous cases against OpenAI and others have argued about whether training constitutes fair use, whether models "memorize" content, whether outputs infringe. This suit claims Meta knowingly copied complete works from illegal sources. That's not a transformative use argument. That's just theft with extra steps.

Key escalation points:

  • Publishers are going after the training process itself, not just model outputs
  • They're naming specific pirate infrastructure Meta allegedly used
  • This is a class action, meaning potential exposure scales with every copyrighted work in Llama's training corpus

The timing matters. Meta's been positioning Llama as the open-weight alternative to closed models, arguing that transparency and researcher access justify giving away models other companies charge for. But if the foundation is pirated content, "open" starts to look like "laundering at scale." You can't open-source someone else's property.

For publishers, this is existential. If AI companies can train on pirated copies of every book ever written, then sell or give away models that can summarize, analyze, or remix that knowledge, what's left to sell? The value isn't in the physical book anymore. It's in the information, and that just got compressed into weights and biases.

The Implication

If publishers win, every AI company will need to audit its training data provenance or start writing checks. Google, Anthropic, OpenAI, they've all been vague about sources for a reason. This lawsuit could force disclosure or establish that training on pirated content isn't defensible as fair use, even if you delete the originals afterward.

For anyone building agents or betting on open models: watch what happens to Llama. If Meta settles or loses, the cost of "open" AI just went up. That might centralize power back to companies that can afford licensing deals, or it might finally force the industry to build synthetic training pipelines that don't depend on scraping the entire internet, pirate sites included. Either way, the free lunch is ending.

Sources

The Verge AI