Imagine someone copying all of your publishing and the publishing of other authors and publishers like you, and saying they had the right to do so because they were training their AI solution — one that helps people receive the guidance they would have received from your publishing and the publishing of others like you. And imagine you’re a lawyer who’s been sharing your insight and commentary to help people, and the person copying is a legal publisher.
This is not that far-fetched under our nation’s Fair Use doctrine and the longstanding idea that people learn from existing work and create something new from what they absorbed. Think of an artist spending a day in a museum, making notes on the artwork, then walking out and creating drawings that closely follow the works in the museum, or are inspired by them.
As Alexandra Alter of the New York Times reported on Wednesday,
”Five major publishers — Hachette, Macmillan, McGraw Hill, Elsevier and Cengage — and the best-selling novelist Scott Turow have filed a class-action copyright infringement lawsuit against Meta and its founder and chief executive, Mark Zuckerberg.
The complaint, which was filed on Tuesday morning in United States District Court for the Southern District of New York, accuses Meta and Zuckerberg of illegally using millions of copyrighted works to train their artificial intelligence program Llama, and of removing copyright notices and other copyright management information from those works.
The lawsuit asserts that Meta’s engineers relied on pirated books and journal articles to train the program by downloading unlicensed copies through websites like Anna’s Archive, an open source search engine for piracy sites including LibGen and Sci-Hub. The suit also claims that “Zuckerberg himself personally authorized and actively encouraged the infringement.”
To which Meta’s spokesman told The Timess,
“A.I. is powering transformative innovations, productivity and creativity for individuals and companies, and courts have rightly found that training A.I. on copyrighted material can qualify as fair use. We will fight this lawsuit aggressively.”
Seems obvious that a court would protect the copyrighted work of publishers. But not always.
Last year, two courts in the Northern District of California found that training large language models on copyrighted books was “transformative” and qualified as fair use. Both turned heavily on the first fair use factor (purpose and character of the use).
As to a legal publishing scraping the publishing of lawyers and law firms for training AI solutions, whether for research, a library, or for other tools, I don’t know.
There are some sizable risks.
- The precedent is one that AI lost. Ross v. Thomson Reuters — legal research competitor scrapes a legal publisher to train AI — went against the AI company on fair use, and went against them on the fourth factor: market harm.
- A licensing feed of such content already exists, a factor the court would consider against Fair Use.
- Damages could be huge with a court awarding damages for each breach.
- Worst possible plaintiffs. The contributors are lawyers. They know how to file, they have networks, and being credited and cited is the whole reason they wrote under their names.
- Almost 200,000 lawyers in just the 200 largest law firms.
- Reputational risk in their own customer base.
The precedent that fits goes the wrong way, the market harm record is already built against you, and the plaintiff class is your own customer base.
The strategic point for you: every license you sign makes this calculus worse for anyone tempted to skip the license.
It must have have been mid last year, that our general counsel sent me an article about an AI company being sued for a huge sum for scarping copyrighted publisher. When I mentioned ‘derivative,’ he said just don’t.
My son, Colin, with a background in journalism and digital production has made clear from the start, we’re not scraping the work of publishers.
I think they’re right.