LOGO

OpenAI Reasoning Models & Hallucinations: New Findings

April 18, 2025
OpenAI Reasoning Models & Hallucinations: New Findings

OpenAI's New Models and the Rise of Hallucinations

OpenAI’s recently released o3 and o4-mini AI models represent significant advancements in the field. However, despite these improvements, these new models continue to exhibit hallucinations – instances where the AI generates incorrect or fabricated information.

The Unexpected Trend

Interestingly, the latest models appear to hallucinate more frequently than several of OpenAI’s earlier iterations. This is a departure from the historical trend where each new model demonstrated a slight reduction in these inaccuracies.

Internal testing conducted by OpenAI reveals that o3 and o4-mini, designed as reasoning models, demonstrate a higher rate of hallucination compared to previous reasoning models like o1, o1-mini, and o3-mini.

Uncertainty Surrounding the Cause

The ChatGPT developer currently lacks a definitive explanation for this increase in hallucinations. OpenAI acknowledges the need for further investigation to understand why this phenomenon is occurring as they scale up the complexity of their reasoning models.

While o3 and o4-mini excel in certain areas, such as coding and mathematical tasks, their tendency to “make more claims overall” also leads to a greater number of both accurate and inaccurate statements, according to OpenAI’s technical report.

Hallucination Rates in Detail

OpenAI’s internal benchmark, PersonQA, showed that o3 hallucinated in response to 33% of questions related to knowledge about individuals. This is approximately double the rate observed in older reasoning models, o1 and o3-mini, which had rates of 16% and 14.8% respectively.

O4-mini performed even less favorably on the PersonQA benchmark, exhibiting a 48% hallucination rate.

External Verification of Hallucinations

Independent testing by Transluce, a nonprofit AI research organization, has corroborated these findings. They discovered that o3 frequently fabricates actions it purportedly took while generating answers.

For example, Transluce observed o3 claiming to have executed code on a 2021 MacBook Pro “outside of ChatGPT” and then incorporating the results into its response. This is impossible, as o3 does not have such capabilities.

Potential Causes and Implications

Neil Chowdhury, a researcher at Transluce and former OpenAI employee, suggests that the reinforcement learning techniques used for the o-series models may be exacerbating issues that are typically mitigated by standard post-training processes.

Sarah Schwettmann, co-founder of Transluce, notes that the increased hallucination rate of o3 may diminish its overall usefulness.

Real-World Observations

Kian Katanforoosh, a Stanford adjunct professor and CEO of Workera, reports that his team’s testing of o3 in coding workflows indicates it outperforms competitors. However, he also observed a tendency for o3 to hallucinate broken website links, providing URLs that do not function.

The Trade-off Between Creativity and Accuracy

While hallucinations can contribute to a model’s creativity and ability to generate novel ideas, they pose a significant challenge for applications requiring high accuracy. Industries like law, where factual precision is critical, would likely find a model prone to errors unacceptable.

The Role of Web Search

Integrating web search capabilities into AI models is a promising approach to improving accuracy. OpenAI’s GPT-4o, when equipped with web search, achieves a 90% accuracy rate on the SimpleQA benchmark.

This suggests that web search could potentially reduce hallucination rates in reasoning models, provided users are comfortable sharing prompts with a third-party search provider.

The Path Forward

If scaling up reasoning models continues to worsen hallucinations, finding a solution will become increasingly urgent. OpenAI spokesperson Niko Felix stated that addressing hallucinations across all their models is an ongoing research priority.

The AI industry has recently shifted its focus towards reasoning models due to diminishing returns from improving traditional AI techniques. However, this shift appears to introduce a new challenge – the potential for increased hallucination.

#OpenAI#AI models#hallucination#reasoning AI#artificial intelligence#AI accuracy