OpenAI AI Models Trained on O'Reilly Books

Allegations of Copyrighted Material Use in OpenAI's AI Training

Numerous accusations have surfaced regarding OpenAI's utilization of copyrighted material for training its artificial intelligence models without obtaining proper authorization. A recent study conducted by an AI watchdog organization presents a significant claim: the company increasingly depended on books not publicly available and for which it lacked licensing agreements to develop its more advanced AI systems.

How AI Models Learn

AI models function as sophisticated predictive systems. They are trained on extensive datasets – encompassing books, films, television programs, and more – to identify patterns and develop innovative methods for extrapolating information from simple prompts.

When a model generates text, such as an essay on a classical Greek play, or creates an image in the style of Studio Ghibli, it is essentially drawing upon its vast knowledge base to create an approximation. Genuine novelty is not being produced.

The Shift Towards Synthetic Data

While several AI laboratories, including OpenAI, have begun incorporating AI-generated data into their training processes as they exhaust readily available real-world sources – primarily from the public web – few have completely abandoned the use of real-world data.

This is likely due to the inherent risks associated with training solely on synthetic data, including the potential for diminished model performance.

The AI Disclosures Project Findings

The new paper, released by the AI Disclosures Project – a nonprofit established in 2024 by media executive Tim O’Reilly and economist Ilan Strauss – suggests that OpenAI likely trained its GPT-4o model using paywalled books from O’Reilly Media. Notably, Mr. O’Reilly also serves as the CEO of O’Reilly Media.

GPT-4o currently functions as the default model within ChatGPT. The paper indicates that OpenAI does not possess a licensing agreement with O’Reilly for these materials.

Comparative Analysis of GPT-4o and GPT-3.5 Turbo

“GPT-4o, OpenAI’s latest and most capable model, exhibits a notable ability to recognize content from O’Reilly books that are behind a paywall… in comparison to OpenAI’s earlier model, GPT-3.5 Turbo,” the paper’s authors stated.

Conversely, GPT-3.5 Turbo demonstrates a greater recognition of publicly accessible samples from O’Reilly books.

The DE-COP Method

The research employed a technique known as DE-COP, initially presented in an academic study in 2024. This method is designed to identify copyrighted content within the training data of language models.

Also referred to as a “membership inference attack,” the technique assesses whether a model can consistently differentiate between texts authored by humans and paraphrased, AI-generated versions of the same text. Successful differentiation suggests the model may have prior knowledge of the text from its training data.

Testing OpenAI Models

The authors – O’Reilly, Strauss, and AI researcher Sruly Rosenblat – evaluated the knowledge of GPT-4o, GPT-3.5 Turbo, and other OpenAI models regarding O’Reilly Media books published both before and after their respective training cutoff dates.

They analyzed 13,962 paragraph excerpts from 34 O’Reilly books to estimate the probability that a specific excerpt was included in a model’s training dataset.

Results and Implications

The study’s findings indicate that GPT-4o “recognized” a significantly larger amount of paywalled O’Reilly book content compared to OpenAI’s older models, specifically GPT-3.5 Turbo. This observation remained consistent even after accounting for potential confounding variables, such as improvements in newer models’ ability to identify human-authored text.

“GPT-4o [likely] recognizes, and therefore possesses prior knowledge of, numerous non-public O’Reilly books published prior to its training cutoff date,” the authors concluded.

Caveats and Further Research

The authors emphasize that their findings do not constitute definitive proof. They acknowledge the limitations of their experimental methodology and the possibility that OpenAI may have acquired the paywalled book excerpts through user input – such as copying and pasting into ChatGPT.

Furthermore, the study did not evaluate OpenAI’s most recent models, including GPT-4.5 and “reasoning” models like o3-mini and o1. It remains possible that these models were not trained on paywalled O’Reilly book data, or were trained on a smaller quantity.

OpenAI's Data Acquisition Strategies

It is well-known that OpenAI, a proponent of more flexible regulations regarding the use of copyrighted data for model development, has been actively seeking higher-quality training data. The company has even hired journalists to refine the outputs of its models.

This trend is widespread within the industry, with AI companies increasingly recruiting experts in various fields – such as science and physics – to effectively integrate their knowledge into AI systems.

Existing Licensing Agreements

It is important to note that OpenAI does compensate for at least some of its training data. The company maintains licensing agreements with news publishers, social networks, stock media libraries, and other entities.

OpenAI also provides opt-out mechanisms – although imperfect – allowing copyright holders to request that their content not be used for training purposes.

Legal Challenges and Ongoing Scrutiny

As OpenAI faces multiple lawsuits concerning its data training practices and interpretation of copyright law in U.S. courts, the O’Reilly paper presents an unfavorable perspective.

OpenAI did not provide a response to a request for comment.

Topics

More

OpenAI AI Models Trained on O'Reilly Books - Research Suggests

Allegations of Copyrighted Material Use in OpenAI's AI Training

How AI Models Learn

The Shift Towards Synthetic Data

The AI Disclosures Project Findings

Comparative Analysis of GPT-4o and GPT-3.5 Turbo

The DE-COP Method

Testing OpenAI Models

Results and Implications

Caveats and Further Research

OpenAI's Data Acquisition Strategies

Existing Licensing Agreements

Legal Challenges and Ongoing Scrutiny

Related Posts

ChatGPT Launches App Store for Developers

Pickle Robot Appoints Tesla Veteran as First CFO

Peripheral Labs: Self-Driving Car Sensors Enhance Sports Fan Experience

Luma AI: Generate Videos from Start and End Frames

Alexa+ Adds AI to Ring Doorbells - Amazon's New Feature

Amazon Appoints Peter DeSantis to Lead New AI Organization