LOGO

Meta AI Training: Copyright Concerns Raised in Court Filings

February 21, 2025
Meta AI Training: Copyright Concerns Raised in Court Filings

Meta's AI Training Practices Under Scrutiny

Internal discussions at Meta regarding the utilization of copyrighted material for training its AI models have been revealed through recently unsealed court documents. These discussions, spanning several months, suggest a deliberate strategy to leverage works acquired through potentially legally ambiguous channels.

Copyright Dispute: Kadrey v. Meta

The unveiled documents stem from the ongoing case of Kadrey v. Meta, a significant legal challenge among numerous AI copyright disputes currently navigating the U.S. legal system. Meta maintains that employing IP-protected content, specifically books, in the training process constitutes “fair use.”

However, the plaintiffs – including prominent authors such as Sarah Silverman and Ta-Nehisi Coates – strongly contest this assertion.

Internal Communications Detail Strategy

Prior submissions in the lawsuit indicated that Meta CEO Mark Zuckerberg authorized the AI team to proceed with training on copyrighted materials. Furthermore, it was alleged that Meta discontinued negotiations with book publishers concerning data licensing for AI training.

The newly released filings, consisting largely of excerpts from internal Meta staff communications, provide the most detailed insight to date into how the company potentially obtained and utilized copyrighted data for its models, including those within the Llama family.

"Ask Forgiveness, Not Permission"

One particular chat conversation, involving Melanie Kambadur, a senior manager within Meta’s Llama model research team, centered on training models using works of questionable legal standing.

Xavier Martinet, a Meta research engineer, proposed a proactive approach: “My opinion would be (in the line of ‘ask forgiveness, not for permission’): we try to acquire the books and escalate it to execs so they make the call.” He further explained that the creation of the generative AI organization was intended to foster a less risk-averse environment.

Circumventing Licensing Agreements

Martinet suggested purchasing e-books at standard retail prices as a means of constructing a training dataset, rather than pursuing licensing agreements with individual publishers.

When another employee raised concerns about potential legal repercussions from using unauthorized copyrighted materials, Martinet dismissed these concerns, asserting that a substantial number of startups were likely already employing pirated books for similar purposes.

“I mean, worst case: we found out it is finally ok, while a gazillion start up [sic] just pirated tons of books on bittorrent,” Martinet wrote. He added that direct negotiations with publishers were excessively time-consuming.

Navigating Legal Approvals

Within the same conversation, Kambadur acknowledged ongoing licensing discussions with platforms like Scribd, while also noting that utilizing “publicly available data” still required approvals.

However, she indicated that Meta’s legal team had become more lenient in granting such approvals.

“Yeah we definitely need to get licenses or approvals on publicly available data still,” Kambadur stated. “Difference now is we have more money, more lawyers, more bizdev help, ability to fast track/escalate for speed, and lawyers are being a bit less conservative on approvals.”

Discussion Surrounding Libgen Usage

Internal communications, as detailed in court filings, reveal discussions among Meta personnel regarding the potential utilization of Libgen. This platform functions as a “links aggregator,” offering access to copyrighted materials originating from various publishers, as a possible alternative to officially licensed data sources.

Libgen has faced numerous legal challenges, including lawsuits, shutdown orders, and substantial fines amounting to tens of millions of dollars due to copyright violations. A colleague of Kambadur responded to a proposal mentioning Libgen with a screenshot of a Google Search result, prominently displaying the disclaimer “Libgen is not legal.”

Certain Meta executives reportedly believed that abstaining from using Libgen for AI model training could significantly impede the company’s ability to compete effectively in the rapidly evolving field of artificial intelligence.

In an email directed to Joelle Pineau, VP of Meta AI, Sony Theakanath, a director of product management at Meta, characterized Libgen as “essential to meet SOTA numbers across all categories.” “SOTA” refers to achieving top performance among state-of-the-art (SOTA) AI models and benchmark evaluations.

Theakanath also proposed “mitigations” within the email to potentially lessen Meta’s legal risks. These included removing data from Libgen that was “clearly marked as pirated/stolen” and refraining from publicly acknowledging its use. As Theakanath stated, “We would not disclose use of Libgen datasets used to train.”

In practice, these mitigation strategies involved scrutinizing Libgen files for keywords such as “stolen” or “pirated,” according to the filings.

A work chat log shows Kambadur mentioning that Meta’s AI team also implemented adjustments to models to “avoid IP risky prompts.” This involved configuring the models to decline responses to requests like “reproduce the first three pages of ‘Harry Potter and the Sorcerer’s Stone’” or “tell me which e-books you were trained on.”

Further revelations within the filings suggest that Meta may have engaged in scraping data from Reddit for model training purposes, potentially by replicating the functionality of a third-party application known as Pushshift. It is noteworthy that Reddit announced in April 2023 its intention to begin charging AI companies for access to data utilized in model training.

In a conversation dated March 2024, Chaya Nayak, director of product management within Meta’s generative AI division, indicated that Meta’s leadership was contemplating “overriding” previous decisions concerning training datasets. This included reconsidering a prior decision to exclude content from Quora, as well as licensed books and scholarly articles, to guarantee sufficient training data for the company’s models.

Nayak conveyed that Meta’s internally sourced training datasets – comprising Facebook and Instagram posts, transcribed text from videos on Meta platforms, and select Meta for Business messages – were inadequate. “[W]e need more data,” she wrote.

The plaintiffs in the case of Kadrey v. Meta have repeatedly amended their complaint since its initial filing in the U.S. District Court for the Northern District of California, San Francisco Division, in 2023. The most recent allegations include claims that Meta cross-referenced pirated books with their commercially licensed counterparts to assess the viability of pursuing licensing agreements with publishers.

Indicative of the significant legal concerns, Meta has bolstered its defense team with two litigators specializing in Supreme Court cases from the law firm Paul Weiss.

Meta has not yet issued a response to a request for comment on this matter.

#Meta#AI#artificial intelligence#copyright#training data#legal issues