OpenAI Upgrades Transcription & Voice AI Models

OpenAI Enhances API with Advanced Transcription and Voice Generation Models

OpenAI has announced the release of updated AI models for both transcription and voice generation through its API. These new models represent a significant improvement over their predecessors, according to the company.

Advancing the "Agentic" Vision

These developments align with OpenAI’s overarching goal of creating “agentic” systems. These are automated systems designed to independently complete tasks for users. Olivier Godement, Head of Product at OpenAI, envisions these agents as sophisticated chatbots capable of handling customer interactions for businesses.

“We anticipate a growing number of agents emerging in the coming months,” Godement stated to TechCrunch. “Our focus is on providing customers and developers with the tools to build agents that are both effective and reliable.”

New Text-to-Speech Capabilities: gpt-4o-mini-tts

The new text-to-speech model, gpt-4o-mini-tts, delivers more natural and expressive speech than previous iterations. It also offers enhanced “steerability,” allowing developers greater control over the vocal characteristics.

Developers can now instruct the model to adopt specific speaking styles using natural language commands. For example, they can request a voice resembling “a mad scientist” or one that embodies “the serenity of a mindfulness teacher.”

Here’s an example of a voice styled to sound like a weathered, true crime narrator:

And here’s a sample of a professional female voice:

Jeff Harris, a product team member at OpenAI, explained that the aim is to enable developers to customize both the overall voice “experience” and the specific “context.”

“A monotonous voice isn’t suitable for all situations,” Harris noted. “In customer support, for instance, the voice can convey apology when an error occurs. We believe developers and users want precise control over not only the content of speech, but also its delivery.”

Improved Speech-to-Text: gpt-4o-transcribe and gpt-4o-mini-transcribe

OpenAI’s new speech-to-text models, gpt-4o-transcribe and gpt-4o-mini-transcribe, are designed to supersede the older Whisper transcription model. These models were trained on extensive and diverse audio datasets.

OpenAI asserts that the new models demonstrate improved accuracy in capturing accented speech and handling noisy environments. They are also less prone to generating inaccurate or fabricated content.

Whisper was known to occasionally invent words or even entire phrases during transcription, sometimes introducing unintended and inappropriate content.

“These models represent a substantial improvement over Whisper in terms of accuracy,” Harris emphasized. “Reliable voice experiences depend on accurate transcription, meaning the models should accurately capture spoken words without adding extraneous details.”

However, transcription accuracy can vary depending on the language being processed.

Language-Specific Performance

Internal benchmarks from OpenAI indicate that gpt-4o-transcribe, the more precise of the two transcription models, exhibits a “word error rate” of approximately 30% (relative to a baseline of 120%) for Indic and Dravidian languages like Tamil, Telugu, Malayalam, and Kannada. This translates to roughly three out of ten words differing from a human-generated transcription in these languages.

openai upgrades its transcription and voice-generating ai models

Shift in Availability: No Open Source Release

In a departure from past practice, OpenAI will not be making its new transcription models publicly available. Previously, new versions of Whisper were released under an MIT license for commercial use.

Harris explained that gpt-4o-transcribe and gpt-4o-mini-transcribe are significantly larger and more complex than Whisper, making an open release impractical.

“These models are not designed to run locally on standard laptops, unlike Whisper,” he clarified. “We are committed to thoughtful open-source releases, focusing on models specifically tailored for those environments. We believe end-user devices represent a particularly compelling use case for open-source models.”

Updated March 20, 2025, 11:54 a.m. PT to clarify the language surrounding word error rate and to reflect updated benchmark results.

Topics

More

OpenAI Upgrades Transcription & Voice AI Models - Latest News

OpenAI Enhances API with Advanced Transcription and Voice Generation Models

Advancing the "Agentic" Vision

New Text-to-Speech Capabilities: gpt-4o-mini-tts

Improved Speech-to-Text: gpt-4o-transcribe and gpt-4o-mini-transcribe

Language-Specific Performance

Related Posts

ChatGPT Launches App Store for Developers

Pickle Robot Appoints Tesla Veteran as First CFO

Peripheral Labs: Self-Driving Car Sensors Enhance Sports Fan Experience

Luma AI: Generate Videos from Start and End Frames

Alexa+ Adds AI to Ring Doorbells - Amazon's New Feature

Amazon Appoints Peter DeSantis to Lead New AI Organization