LOGO

openai says gpt-5 stacks up to humans in a wide range of jobs

September 25, 2025
openai says gpt-5 stacks up to humans in a wide range of jobs

OpenAI's New AI Performance Benchmark: GDPval

On Thursday, OpenAI unveiled a novel benchmark designed to evaluate the performance of its AI models in comparison to human professionals across diverse sectors and occupations. This test, known as GDPval, represents an initial effort to gauge how closely OpenAI’s systems are approaching the ability to surpass human capabilities in economically significant work. This aligns with the company’s core objective of developing artificial general intelligence (AGI).

GPT-5 and Claude Opus 4.1 Performance

OpenAI reports that its forthcoming GPT-5 model, alongside Anthropic’s Claude Opus 4.1, are demonstrating performance levels nearing that of work produced by established industry experts.

It’s important to note that widespread job displacement due to AI is not imminent. While some predictions suggest AI could replace human workers within a few years, OpenAI clarifies that the current GDPval assessment covers a limited scope of tasks encountered in real-world professions. Nevertheless, it serves as a crucial metric for tracking AI’s advancement toward this pivotal milestone.

GDPval Methodology and Industry Focus

GDPval is structured around nine industries that collectively contribute the most to the United States’ gross domestic product. These include key areas such as healthcare, finance, manufacturing, and government.

The benchmark assesses AI performance across 44 distinct occupations within these industries, encompassing roles ranging from software engineers to nurses and journalists.

Evaluation Process

For the initial iteration of the test, GDPval-v0, OpenAI enlisted experienced professionals to compare AI-generated reports with those created by their peers. Participants were then asked to identify the superior report.

As an example, investment bankers were tasked with developing a competitive analysis of the last-mile delivery sector, subsequently comparing their work to reports generated by AI. OpenAI then calculates an AI model’s “win rate” against human-authored reports across all 44 occupations.

GPT-5-High Results

According to OpenAI, a high-performance version of GPT-5, designated GPT-5-high, achieved a ranking as better than or equivalent to industry experts in 40.6% of evaluations.

Anthropic’s Claude Opus 4.1 model was assessed as better than or on par with industry experts in 49% of tasks. OpenAI suggests that Claude’s higher score may be partially attributable to its proficiency in generating visually appealing graphics, rather than solely on performance metrics.

openai says gpt-5 stacks up to humans in a wide range of jobsLimitations and Future Development

It is crucial to recognize that the majority of professionals engage in tasks beyond simply submitting research reports, which is the sole focus of GDPval-v0. OpenAI acknowledges this limitation and intends to develop more comprehensive tests in the future.

These future tests will aim to encompass a broader range of industries and incorporate interactive workflows for a more realistic assessment.

Potential Impact on the Workforce

OpenAI views the progress demonstrated on GDPval as significant. Dr. Aaron Chatterji, OpenAI’s chief economist, stated in an interview with TechCrunch that the results indicate that professionals in these fields can now leverage AI models to dedicate more time to higher-value activities.

“As the model’s capabilities improve,” Chatterji explains, “individuals in these roles can increasingly utilize it to offload certain tasks and concentrate on more impactful work.”

Rate of Progress

Tejal Patwardhan, an evaluator at OpenAI, expressed encouragement regarding the pace of advancement on GDPval. OpenAI’s GPT-4o model, released approximately 15 months ago, achieved a score of 13.7% (wins and ties versus humans). GPT-5 now achieves nearly triple that score, a trend Patwardhan anticipates will continue.

Context within AI Benchmarking

Silicon Valley employs a variety of benchmarks to measure the progress of AI models and determine state-of-the-art performance. Popular examples include AIME 2025 (testing competitive math skills) and GPQA Diamond (assessing PhD-level science knowledge).

However, several AI models are approaching saturation on these benchmarks, leading many AI researchers to advocate for improved tests that can accurately measure AI’s proficiency in real-world scenarios.

The Future of AI Evaluation

Benchmarks like GDPval may become increasingly important in this discussion, as OpenAI argues for the practical value of its AI models across various industries. However, a more extensive version of the test may be necessary to definitively demonstrate that AI models can consistently outperform humans.

#GPT-5#OpenAI#AI#artificial intelligence#large language model#LLM