Zuckerberg Knowingly Used Pirated Data to Train Meta AI, Authors Allege - Decrypt

01/10/2025 20:58
Zuckerberg Knowingly Used Pirated Data to Train Meta AI, Authors Allege - Decrypt

A recent court filing in an ongoing lawsuit against Meta alleges Mark Zuckerberg and other executives approved the controversial dataset despite internal warnings.

Mark Zuckerberg approved using pirated books to train Meta AI, even after his own team warned the material was illegally obtained, a group of authors allege in a recent court filing.

The allegations come from a copyright infringement lawsuit filed by a group of authors including the comedian Sarah Silverman, Christopher Golden, and Richard Kadrey in a California federal court in July 2023. The group claimed Meta misused their books to train its Llama LLM, and they're asking for damages and an injunction to stop Meta from using their works. The judge in the case dismissed most of the author's claims in November of that same year, but these recent allegations may breathe new life into the legal dispute.

“Meta’s CEO, Mark Zuckerberg, approved Meta’s use of the LibGen dataset notwithstanding concerns within Meta’s AI executive team (and others at Meta) that LibGen is 'a dataset we know to be pirated,'" lawyers for the plaintiffs said in a Wednesday filing. Despite these red flags, the lawsuit alleges that, "after escalation," Zuckerberg gave the green light for Meta's AI team to proceed with using the controversial dataset.

Representatives for Meta did not immediately respond to Decrypt’s request for comment.

LibGen, short for Library Genesis, is an online platform that provides free access to books, academic papers, articles, and other written publications without properly abiding by copyright laws. It operates as a “shadow library,” offering these materials without authorization from publishers or copyright holders. It currently hosts over 33 million books and over 85 million articles.

The lawsuit alleges Meta tried to keep this under wraps until the last possible moment. Just two hours before the fact discovery deadline on December 13, 2024, the company dumped what plaintiffs describe as "some of the most incriminating internal documents it has produced to date."

Meta's own engineers seemed uncomfortable with the plan, according to statements in court filings. The group of authors allege internal messages show Meta engineers hesitated to download the pirated material, with one noting that "torrenting from a [Meta-owned] corporate laptop doesn't feel right (smile emoji)." Nevertheless, they proceeded to not only download the books but also systematically strip out copyright information to prepare them for AI training, the lawsuit claims.

The latest filings in the lawsuit paint a picture of a company fully aware of the risks: One internal memo warned that "media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, may undermine our negotiating position with regulators." Yet Meta went ahead anyway, both downloading and distributing (or "seeding") the pirated content through torrenting networks by January 2024, according to the lawsuit.

When questioned about these activities in a deposition, Zuckerberg appeared to distance himself from the decision, testifying that such piracy would raise "lots of red flags" and "seems like a bad thing."

The court documents also suggest that Meta's approach to handling copyrighted information paid more attention to model training than copyright rules. According to the filing, one engineer "filtered [...] copyright lines and other data out of LibGen to prepare a CMI-stripped version of it to train Llama." This systematic removal of copyright information could strengthen the authors' claims that Meta knowingly tried to hide its use of pirated materials.

The revelations come at a crucial time for Meta's AI ambitions. The company has been pushing hard to compete with OpenAI and Google in the AI space, with Llama 3.2 being the most popular open source LLM, and Meta AI being a solid free competitor to ChatGPT with similar features.

Most of these AI companies are facing legal battles due to their questionable practices when it comes to training their large language models. Meta was already sued by another group of authors for copyright infringements, OpenAI is currently facing different lawsuits for training its LLMs on copyrighted material, and Anthropic is also facing different accusations from authors and songwriters.

But in general the tech entrepreneurs and creators have been up in arms ever since generative AI exploded in popularity. There are currently dozens of different lawsuits against AI companies for willingly using copyrighted material to train their models. But as with most things on the bleeding edge, we’ll have to wait and see what the courts have to say about it all.

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.

Read more --->