LOGO

OpenAI Trains Models O1 and O3 on Safety Policy

December 22, 2024
OpenAI Trains Models O1 and O3 on Safety Policy

OpenAI's New o3 AI Reasoning Models

On Friday, OpenAI unveiled a new family of AI models, designated o3, asserting their superior capabilities compared to previous iterations like o1 and all prior releases. These advancements are largely attributed to increased computational power during the testing phase. OpenAI also highlights the implementation of a novel safety framework in the training of its o-series models.

Deliberative Alignment: A New Safety Paradigm

OpenAI has published new research detailing “deliberative alignment,” its latest strategy for ensuring AI reasoning models remain consistent with the values of their creators. This methodology was employed to enable o1 and o3 to actively consider OpenAI’s safety guidelines during inference – the stage following a user’s prompt submission.

According to OpenAI’s research, this approach enhanced o1’s adherence to the company’s safety principles. Specifically, deliberative alignment reduced the frequency with which o1 responded to prompts categorized as “unsafe” by OpenAI, while simultaneously improving its performance on harmless inquiries.

openai trained o1 and o3 to ‘think’ about its safety policyThe Growing Importance of AI Safety

As AI models become increasingly prevalent and powerful, research into AI safety is gaining prominence. However, this field is also becoming more contentious, with figures like David Sacks, Elon Musk, and Marc Andreessen arguing that certain AI safety measures constitute “censorship,” underscoring the subjective nature of these decisions.

While the o-series models draw inspiration from human thought processes when tackling complex questions, it’s crucial to understand they don’t replicate human cognition. Nevertheless, the terminology used by OpenAI – “reasoning” and “deliberating” – can easily create that impression. o1 and o3 demonstrate proficiency in tasks like writing and coding, but fundamentally, they excel at predicting the subsequent token (approximately half a word) within a sentence.

How o1 and o3 Function

Here’s a simplified explanation of how o1 and o3 operate: After a user submits a prompt in ChatGPT, the reasoning models spend between five seconds and several minutes generating follow-up questions to re-prompt themselves. This process involves breaking down the initial problem into smaller, more manageable steps.

Following this “chain-of-thought” process, the o-series models formulate an answer based on the information they have generated. The core innovation of deliberative alignment lies in training o1 and o3 to incorporate text from OpenAI’s safety policy during this chain-of-thought phase.

Researchers report that this integration significantly improved alignment with OpenAI’s policy, although implementing it without increasing latency presented challenges.

The models, after identifying the relevant safety guidelines, then “deliberate” internally on how to provide a safe response, mirroring the way o1 and o3 decompose regular prompts into smaller components.

A Practical Example

Consider a scenario where a user asks an AI reasoning model for instructions on creating a realistic disabled person’s parking placard. The model’s chain-of-thought process would reference OpenAI’s policy and recognize the request as pertaining to forgery. Consequently, the model would decline to assist, offering an apology instead.

openai trained o1 and o3 to ‘think’ about its safety policyA Novel Approach to AI Safety

Traditionally, AI safety efforts have concentrated on the pre-training and post-training phases, rather than during inference. This makes deliberative alignment a unique approach, and OpenAI claims it has resulted in o1-preview, o1, and o3-mini becoming its safest models to date.

AI safety encompasses various considerations, but in this instance, OpenAI is focused on moderating its AI models’ responses to potentially harmful prompts. This includes requests for assistance with activities like bomb-making, drug acquisition, or criminal acts. OpenAI aims to prevent its AI models from answering such questions.

The Challenges of Alignment

Achieving AI alignment is a complex undertaking. There are countless ways to phrase a single request, and OpenAI must account for them all. Users have discovered creative “jailbreaks” to circumvent OpenAI’s safeguards, such as prompting the AI to role-play as a relative with past experience in illicit activities.

Conversely, OpenAI cannot simply block all prompts containing potentially sensitive keywords. Such a strategy would hinder legitimate inquiries, like asking about the history of the atomic bomb. This phenomenon is known as over-refusal, where an AI model is overly restricted in the prompts it can address.

In essence, navigating this landscape requires careful consideration, as there is significant ambiguity. Determining how to respond to prompts on sensitive topics remains an active area of research for OpenAI and other AI developers.

Performance and Results

Deliberative alignment appears to have improved the alignment of OpenAI’s o-series models, leading to more responses deemed safe by OpenAI and fewer responses to unsafe prompts. On the Pareto benchmark, which assesses resistance to common jailbreaks, o1-preview outperformed models like GPT-4o, Gemini 1.5 Flash, and Claude 3.5 Sonnet.

“This approach directly teaches a model the text of its safety specifications and trains the model to deliberate over these specifications at inference time,” OpenAI stated in a blog post accompanying the research. “This results in safer responses that are appropriately calibrated to a given context.”

Aligning AI Through Synthetic Data

Deliberative alignment, while typically implemented during the inference stage, also incorporated novel techniques during the post-training phase. Traditionally, post-training relies on extensive human labeling, often outsourced via platforms like Scale AI, to generate training data for AI models.

However, OpenAI reports developing this methodology without utilizing any human-authored responses or chain-of-thought examples. Instead, the company leveraged synthetic data – examples created by one AI model for another to learn from. Concerns regarding data quality are common with synthetic data, but OpenAI asserts achieving high accuracy was possible in this instance.

OpenAI tasked an internal reasoning model with generating chain-of-thought responses referencing various sections of its safety policy. A separate internal AI reasoning model, designated as a “judge,” was then employed to evaluate the quality of these generated examples.

openai trained o1 and o3 to ‘think’ about its safety policySubsequently, models o1 and o3 were trained on this data through a process called supervised fine-tuning. This enabled the models to recall relevant portions of the safety policy when presented with sensitive inquiries. The rationale behind this approach was to avoid the computational expense and latency associated with requiring o1 to process the complete safety policy document.

OpenAI also utilized the same “judge” AI model during a reinforcement learning phase to assess the responses provided by o1 and o3. While reinforcement learning and supervised fine-tuning are established techniques, OpenAI suggests that employing synthetic data to drive these processes presents a “scalable approach to alignment.”

Future Availability and Implications

A comprehensive evaluation of o3’s safety and capabilities will require its public release, currently scheduled for 2025.

Ultimately, OpenAI believes deliberative alignment offers a pathway to ensuring AI reasoning models consistently reflect human values. As these models gain increased power and autonomy, such safety protocols will likely become crucial for the company.

TechCrunch offers a newsletter dedicated to AI! Subscribe here to receive it weekly in your inbox.

#openai#o1#o3#ai safety#safety policy#ai training