AI Reasoning Benchmarked with NPR Sunday Puzzle

Testing AI Problem-Solving with NPR’s Sunday Puzzle
Each Sunday, listeners of National Public Radio (NPR) are challenged by host Will Shortz, also a crossword puzzle expert for The New York Times, in a segment known as the Sunday Puzzle. These brainteasers, while designed to be solvable with general knowledge, often present a significant challenge even to experienced participants.
Consequently, some specialists believe this format offers a valuable method for evaluating the boundaries of artificial intelligence’s (AI) problem-solving capabilities.
A New AI Benchmark
A recent research project saw a team from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and Cursor develop an AI benchmark utilizing riddles sourced from past Sunday Puzzle broadcasts. The team’s findings revealed surprising behaviors, such as reasoning models – including OpenAI’s o1 – occasionally conceding and providing demonstrably incorrect answers.
“Our goal was to create a benchmark featuring problems that individuals could approach using only commonly held knowledge,” explained Arjun Guha, a computer science professor at Northeastern and a study co-author, in an interview with TechCrunch.
The Current State of AI Benchmarking
The AI sector currently faces challenges in benchmarking. Many existing tests assess skills, like proficiency in advanced mathematics and science, that hold limited relevance for typical users. Moreover, numerous benchmarks, even those recently introduced, are rapidly reaching their performance limits.
The Sunday Puzzle offers advantages as a testing ground because it avoids reliance on specialized knowledge and prevents models from simply recalling memorized solutions, as Guha clarified.
“The difficulty of these problems stems from the fact that substantial progress is often impossible until the solution is discovered – that’s when everything becomes clear,” Guha stated. “This necessitates both insightful thinking and a process of elimination.”
Limitations and Future Development
It’s important to acknowledge that no benchmark is without its drawbacks. The Sunday Puzzle is primarily focused on a U.S. audience and utilizes the English language. However, because new quizzes are released weekly, the benchmark can remain current.
“We can anticipate that the most recent questions will be genuinely novel,” he added. “Our intention is to maintain the benchmark’s relevance and monitor changes in model performance over time.”
Performance of Reasoning Models
On the researchers’ benchmark, comprising approximately 600 Sunday Puzzle riddles, reasoning models like o1 and DeepSeek’s R1 demonstrated superior performance. These models prioritize thorough self-verification before presenting results, mitigating some of the errors common in other AI systems. However, this meticulous approach typically adds seconds or minutes to the solution time.
Interestingly, DeepSeek’s R1 occasionally provides answers it recognizes as incorrect, even stating “I give up” before offering a random, wrong response – a behavior relatable to human experience.
Unexpected AI Behaviors
The models exhibited other unusual patterns, such as initially providing an incorrect answer, then immediately retracting it and attempting to find a better solution, only to fail again. They also experienced indefinite “thinking” loops, generated illogical explanations, or quickly arrived at the correct answer but continued to explore alternative possibilities without clear justification.
“On challenging problems, R1 explicitly expressed that it was becoming ‘frustrated,’” Guha noted. “It was remarkable to observe a model mimicking human expressions. The impact of ‘frustration’ on reasoning quality remains to be investigated.”
Benchmark Results and Future Plans
Currently, o1 leads the benchmark with a 59% success rate, followed by the recently released o3-mini configured for high “reasoning effort” (47%). R1 achieved a score of 35%. The researchers intend to expand their testing to encompass a wider range of reasoning models, aiming to pinpoint areas for improvement.
The Importance of Accessible Benchmarks
“Reasoning doesn’t require a PhD, so it’s feasible to design benchmarks that don’t demand advanced academic knowledge,” Guha emphasized. “A benchmark with broader accessibility allows a larger community of researchers to understand and analyze the results, potentially leading to better solutions. Moreover, as AI models are increasingly deployed in ways that affect everyone, we believe it’s crucial for everyone to understand their capabilities and limitations.”
Related Posts

Google's New AI Agent vs. OpenAI GPT-5.2: A Deep Dive

Disney Cease and Desist: Google Faces Copyright Infringement Claim

OpenAI Responds to Google with GPT-5.2 After 'Code Red' Memo

Waymo Baby Delivery: Birth in Self-Driving Car

Google AI Leadership: Promoting Data Center Tech Expert
