The copyright war just got messier—turns out proving AI theft requires catching the thief red-handed with your stuff, and most of the stealing happens in the dark.

The Summary

  • A 2023 lawsuit by Sarah Silverman and other authors against OpenAI got partially dismissed because the plaintiffs couldn't show specific outputs that directly copied their work—proving training data theft isn't enough in court.
  • Legal harm in copyright cases requires demonstrating that AI outputs compete with or replace the original creator's business, not just that scraping occurred.
  • A shadow industry of AI scraping operates at scale through bots, creating outputs that never surface in public-facing products like ChatGPT or Perplexity, making proof of harm nearly impossible to establish.

The Signal

The Silverman case exposed a structural flaw in how copyright law meets generative AI. The judge didn't dismiss the lawsuit because OpenAI didn't use copyrighted books for training. The dismissal came because the authors couldn't point to specific ChatGPT outputs that directly reproduced their work. Training on copyrighted material, by itself, wasn't actionable harm. You need to prove the model is spitting out something that competes with your original work.

This creates an impossible burden for most creators. If you're a novelist and an LLM trained on your books, you'd need to query the model thousands of times, hoping it generates passages similar enough to prove infringement. And even then, models are designed specifically to avoid direct reproduction. They're engineered to stay just transformative enough to dodge the copyright hammer.

"Scraping content without permission may be detestable, but if the party doing the scraping isn't doing anything with it that would compete with the content creator, it's difficult to prove harm."

Now layer in the shadow scraping economy. Public AI products like ChatGPT are just the visible tip. Behind them sits an entire infrastructure of data brokers, fine-tuning services, and private model builders scraping content at industrial scale. These operations never surface in consumer products. They sell cleaned datasets, embeddings, or custom models to enterprise clients. The outputs never see daylight, which means copyright holders have zero visibility into whether their work is being used, reproduced, or monetized.

This isn't theoretical. Data brokers are packaging scraped news articles, blog posts, and research papers into training sets sold to companies building internal AI tools. A regional bank might buy a dataset scraped from financial news sites to train a customer service bot. That bot never outputs full articles, so the news organizations have no way to detect the infringement. No public output means no evidence. No evidence means no case.

Key dynamics at play:

  • Copyright law wasn't designed for probabilistic systems that remix rather than copy
  • The burden of proof falls on creators who lack access to the black-box systems using their work
  • Shadow scraping operations purposely avoid creating discoverable outputs

The legal system is structured around the idea that infringement leaves a trail. You copy a song, someone hears it. You reprint an article, someone sees it. But generative AI breaks that model. The "copy" exists as statistical weights in a neural network, and the output is a synthesized derivative that's legally distinct from the source. Proving harm requires finding a smoking gun in a system designed to never leave fingerprints.

The Implication

If you create content professionally, understand that current copyright law offers almost no protection against AI scraping unless you can prove direct competitive harm from specific outputs. That's a nearly impossible standard when most scraping feeds private systems you'll never see. The strategic move isn't waiting for legal clarity. It's assuming your work is being scraped right now and building accordingly.

For publishers and platforms, the focus should shift from trying to stop scraping (you can't) to building licensing infrastructure that makes paid access cheaper and easier than adversarial scraping. For individual creators, the play is building direct relationships and distribution channels that AI can't easily replicate—community, curation, real-time interaction, paid access tiers. The law isn't coming to save you. The machines are already here, and they're eating in the dark.

Sources

Fast Company Tech