OpenAI AI Models Trained on O'Reilly Books - Research Suggests

Allegations of Copyrighted Material Use in OpenAI's AI Training
Numerous accusations have surfaced regarding OpenAI's utilization of copyrighted material for training its artificial intelligence models without obtaining proper authorization. A recent study conducted by an AI watchdog organization presents a significant claim: the company increasingly depended on books not publicly available and for which it lacked licensing agreements to develop its more advanced AI systems.
How AI Models Learn
AI models function as sophisticated predictive systems. They are trained on extensive datasets – encompassing books, films, television programs, and more – to identify patterns and develop innovative methods for extrapolating information from simple prompts.
When a model generates text, such as an essay on a classical Greek play, or creates an image in the style of Studio Ghibli, it is essentially drawing upon its vast knowledge base to create an approximation. Genuine novelty is not being produced.
The Shift Towards Synthetic Data
While several AI laboratories, including OpenAI, have begun incorporating AI-generated data into their training processes as they exhaust readily available real-world sources – primarily from the public web – few have completely abandoned the use of real-world data.
This is likely due to the inherent risks associated with training solely on synthetic data, including the potential for diminished model performance.
The AI Disclosures Project Findings
The new paper, released by the AI Disclosures Project – a nonprofit established in 2024 by media executive Tim O’Reilly and economist Ilan Strauss – suggests that OpenAI likely trained its GPT-4o model using paywalled books from O’Reilly Media. Notably, Mr. O’Reilly also serves as the CEO of O’Reilly Media.
GPT-4o currently functions as the default model within ChatGPT. The paper indicates that OpenAI does not possess a licensing agreement with O’Reilly for these materials.
Comparative Analysis of GPT-4o and GPT-3.5 Turbo
“GPT-4o, OpenAI’s latest and most capable model, exhibits a notable ability to recognize content from O’Reilly books that are behind a paywall… in comparison to OpenAI’s earlier model, GPT-3.5 Turbo,” the paper’s authors stated.
Conversely, GPT-3.5 Turbo demonstrates a greater recognition of publicly accessible samples from O’Reilly books.
The DE-COP Method
The research employed a technique known as DE-COP, initially presented in an academic study in 2024. This method is designed to identify copyrighted content within the training data of language models.
Also referred to as a “membership inference attack,” the technique assesses whether a model can consistently differentiate between texts authored by humans and paraphrased, AI-generated versions of the same text. Successful differentiation suggests the model may have prior knowledge of the text from its training data.
Testing OpenAI Models
The authors – O’Reilly, Strauss, and AI researcher Sruly Rosenblat – evaluated the knowledge of GPT-4o, GPT-3.5 Turbo, and other OpenAI models regarding O’Reilly Media books published both before and after their respective training cutoff dates.
They analyzed 13,962 paragraph excerpts from 34 O’Reilly books to estimate the probability that a specific excerpt was included in a model’s training dataset.
Results and Implications
The study’s findings indicate that GPT-4o “recognized” a significantly larger amount of paywalled O’Reilly book content compared to OpenAI’s older models, specifically GPT-3.5 Turbo. This observation remained consistent even after accounting for potential confounding variables, such as improvements in newer models’ ability to identify human-authored text.
“GPT-4o [likely] recognizes, and therefore possesses prior knowledge of, numerous non-public O’Reilly books published prior to its training cutoff date,” the authors concluded.
Caveats and Further Research
The authors emphasize that their findings do not constitute definitive proof. They acknowledge the limitations of their experimental methodology and the possibility that OpenAI may have acquired the paywalled book excerpts through user input – such as copying and pasting into ChatGPT.
Furthermore, the study did not evaluate OpenAI’s most recent models, including GPT-4.5 and “reasoning” models like o3-mini and o1. It remains possible that these models were not trained on paywalled O’Reilly book data, or were trained on a smaller quantity.
OpenAI's Data Acquisition Strategies
It is well-known that OpenAI, a proponent of more flexible regulations regarding the use of copyrighted data for model development, has been actively seeking higher-quality training data. The company has even hired journalists to refine the outputs of its models.
This trend is widespread within the industry, with AI companies increasingly recruiting experts in various fields – such as science and physics – to effectively integrate their knowledge into AI systems.
Existing Licensing Agreements
It is important to note that OpenAI does compensate for at least some of its training data. The company maintains licensing agreements with news publishers, social networks, stock media libraries, and other entities.
OpenAI also provides opt-out mechanisms – although imperfect – allowing copyright holders to request that their content not be used for training purposes.
Legal Challenges and Ongoing Scrutiny
As OpenAI faces multiple lawsuits concerning its data training practices and interpretation of copyright law in U.S. courts, the O’Reilly paper presents an unfavorable perspective.
OpenAI did not provide a response to a request for comment.
Related Posts

Disney Cease and Desist: Google Faces Copyright Infringement Claim

OpenAI Responds to Google with GPT-5.2 After 'Code Red' Memo

Waymo Baby Delivery: Birth in Self-Driving Car

Google AI Leadership: Promoting Data Center Tech Expert
