OpenAI Models and Copyright: Study Reveals Memorization of Copyrighted Content

New Research Suggests OpenAI Models May Retain Copyrighted Material
Recent findings from a new study indicate potential support for claims that OpenAI utilized copyrighted materials during the training of its artificial intelligence models.
Legal Challenges and Fair Use Debate
OpenAI currently faces legal action from authors, software developers, and other copyright holders. These plaintiffs allege that their creative works – encompassing books, code repositories, and similar assets – were employed in the development of OpenAI’s models without proper authorization.
While OpenAI maintains a defense based on fair use principles, the opposing side contends that existing U.S. copyright legislation does not include exemptions specifically for training data utilized in AI development.
A Novel Method for Detecting Memorization
Researchers from the University of Washington, the University of Copenhagen, and Stanford University collaborated on the study. They introduced a new technique designed to identify instances where models, accessed through an API like OpenAI’s, have “memorized” specific training data.
AI models function as predictive systems. Through exposure to extensive datasets, they discern patterns, enabling them to generate text, images, and other content. Although most outputs aren't direct copies, the learning process can sometimes result in verbatim reproduction of training data.
Instances have been documented where image models replicate screenshots from training films, and language models effectively plagiarize segments of news articles.
Identifying “High-Surprisal” Words
The study’s methodology centers on identifying “high-surprisal” words – terms that are statistically unusual within a given context.
For example, the word “radar” in the phrase “Jack and I sat perfectly still with the radar humming” would be considered high-surprisal. This is because words like “engine” or “radio” are more commonly expected to precede “humming.”
Probing OpenAI’s Models
The researchers tested several OpenAI models, including GPT-4 and GPT-3.5, for signs of memorization. They removed high-surprisal words from excerpts of fiction books and articles from the New York Times.
The models were then tasked with predicting the missing words. If a model accurately identified the masked words, the researchers concluded it likely memorized the excerpt during its training phase.
Results Indicate Memorization of Copyrighted Works
The test results revealed that GPT-4 exhibited evidence of memorizing portions of popular fiction books. This included material from BookMIA, a dataset containing samples of copyrighted e-books.
The model also demonstrated memorization of segments from New York Times articles, although at a lower frequency.
The Need for Transparency and Auditability
Abhilasha Ravichander, a doctoral student at the University of Washington and a study co-author, explained to TechCrunch that the findings highlight the potentially “contentious data” used to train these models.
“To foster trust in large language models, we require the ability to probe, audit, and scientifically examine them,” Ravichander stated. “Our research provides a tool for probing these models, but greater data transparency across the entire ecosystem is crucial.”
OpenAI’s Position on Copyright and AI Training
OpenAI has consistently advocated for relaxed regulations concerning the use of copyrighted data in model development.
While the company has established some content licensing agreements and offers opt-out options for copyright holders who wish to prevent their content from being used for training, it has also lobbied governments to formally recognize “fair use” principles in the context of AI training practices.
These efforts aim to establish a legal framework that supports continued innovation in the field of artificial intelligence.
Related Posts

Disney Cease and Desist: Google Faces Copyright Infringement Claim

OpenAI Responds to Google with GPT-5.2 After 'Code Red' Memo

Waymo Baby Delivery: Birth in Self-Driving Car

Google AI Leadership: Promoting Data Center Tech Expert
