LOGO

Meta's Llama AI Trained on Copyrighted Works - Zuckerberg Approval

January 9, 2025
Meta's Llama AI Trained on Copyrighted Works - Zuckerberg Approval

Allegations Surface Regarding Meta's AI Training Data

Legal representatives for the plaintiffs in a copyright lawsuit contend that Mark Zuckerberg, CEO of Meta, authorized the utilization of a dataset comprised of illegally obtained e-books and articles to train the company’s Llama AI models.

Copyright Claims Against Tech Companies

The lawsuit, Kadrey v. Meta, is among a growing number of cases leveled against technology firms involved in AI development. These suits allege the training of AI models on copyrighted material without obtaining proper authorization. Defendants, including Meta, have largely maintained a defense based on fair use, asserting that their use of copyrighted works is sufficiently transformative. However, this argument is disputed by many content creators.

Zuckerberg's Approval of LibGen Dataset

Recent filings in the Kadrey v. Meta case, presented to the U.S. District Court for the Northern District of California, reveal testimony from Meta indicating that Zuckerberg approved the use of a dataset known as LibGen for training purposes related to the Llama models. This information came to light late last year.

What is LibGen?

LibGen identifies itself as a “links aggregator,” offering access to copyrighted works from prominent publishers such as Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. The platform has faced multiple lawsuits, shutdown orders, and substantial fines—totaling tens of millions of dollars—due to copyright violations.

Internal Concerns at Meta

According to Meta’s testimony, as reported by the plaintiffs’ legal team, Zuckerberg sanctioned the use of LibGen despite internal reservations within Meta’s AI division and among other company personnel. Meta employees reportedly described LibGen as a “data set we know to be pirated” and expressed concerns that its use “may undermine [Meta’s] negotiating position with regulators.”

Approval Process and Internal Communication

A memo addressed to Meta AI decision-makers indicates that after “escalation to MZ” – a clear reference to Mark Zuckerberg – the AI team received approval to utilize LibGen. This suggests a deliberate decision made at the highest levels of the company.

Previous Reporting and Data Acquisition Strategies

These details align with prior reporting from The New York Times, which indicated Meta explored various methods to acquire data for its AI initiatives. The company reportedly considered hiring contractors in Africa to summarize books and even contemplated purchasing Simon & Schuster. However, executives determined that negotiating licenses would be too time-consuming and believed fair use provided a strong legal defense.

Allegations of Data Manipulation

The recent filing introduces new accusations, suggesting Meta may have attempted to conceal its alleged infringement by removing attribution from the LibGen data.

Removing Copyright Information

Plaintiffs’ counsel states that Meta engineer Nikolay Bashlykov developed a script to eliminate copyright information, including terms like “copyright” and “acknowledgments,” from e-books sourced from LibGen. Furthermore, Meta allegedly removed copyright markers and “source metadata” from science journal articles used in Llama’s training data.

Concealing Infringement

The filing argues that this action indicates Meta’s intent to not only utilize the data for training but also to conceal its copyright infringement. Stripping copyrighted works, the filing asserts, prevents Llama from generating output that might reveal Meta’s unlawful practices.

Torrenting and Data Distribution

The latest filing also reveals that Meta admitted to torrenting LibGen during depositions, a practice that raised concerns among some Meta research engineers. Torrenting, a method of file distribution, requires users to simultaneously upload the files they are downloading.

Facilitating Copyright Violation

Plaintiffs’ counsel contends that Meta actively participated in copyright infringement by torrenting LibGen and thereby contributing to the spread of its copyrighted content. They further allege that Meta attempted to obscure its activities by limiting the number of files it uploaded.

AI Head Cleared Torrenting

According to the filing, Ahmad Al-Dahle, Meta’s head of generative AI, “cleared the path” for torrenting LibGen, dismissing concerns raised by Bashlykov regarding the potential legal ramifications.

Illegal Acquisition vs. Lawful Methods

“Had Meta bought plaintiffs’ works in a bookstore or borrowed them from a library and trained its Llama models on them without a license, it would have committed copyright infringement,” plaintiffs’ counsel stated in the filing. “Meta’s decision to bypass lawful methods of acquiring books and become a knowing participant in an illegal torrenting network … serves as proof of copyright infringement.”

Case Status and Fair Use Argument

The outcome of the case remains uncertain. Currently, it focuses on Meta’s earlier Llama models and does not encompass its recent releases. The court may ultimately rule in Meta’s favor if it finds the company’s fair use argument persuasive. (A previous court dismissed similar AI-related copyright claims against Meta in 2023, citing a failure to demonstrate infringement.)

Judge's Concerns Regarding Transparency

However, the allegations cast a negative light on Meta, as noted by Judge Vince Chhabria, who presided over the case. In an order issued Wednesday, Chhabria rejected Meta’s request to redact portions of the filing.

Avoiding Negative Publicity

Chhabria wrote that Meta’s sealing request was not intended to protect sensitive business information from competitors but rather to “avoid negative publicity.”

We have contacted Meta’s PR department for a statement and will update this article upon receiving a response.

TechCrunch has an AI-focused newsletter! Sign up here to get it in your inbox every Wednesday.

#Meta#Llama#Mark Zuckerberg#AI#copyright#training data