ElevenLabs Launches Speech-to-Text Model

ElevenLabs Enters Speech-to-Text Arena with Scribe

ElevenLabs, a rapidly growing AI company recently securing $180 million in funding, is expanding its technological focus. While initially recognized for its advanced audio generation capabilities, the company has now introduced its first independent speech-to-text model, named Scribe.

Expanding Beyond Text-to-Speech

Valued at $3.3 billion, ElevenLabs has already established itself by providing text-to-speech services to numerous organizations, leveraging its extensive voice library. Now, the company aims to compete in the field of speech detection, challenging established players such as Gladia, Speechmatics, AssemblyAI, Deepgram, and OpenAI’s Whisper.

Multilingual Support and Accuracy

At launch, ElevenLabs’ Scribe model offers support for over 99 languages. The company has categorized languages based on accuracy, with over 25 achieving a word error rate of under 5%.

This high-accuracy group includes English (with a reported 97% accuracy rate), French, German, Hindi, Indonesian, Japanese, Kannada, Malayalam, Polish, Portuguese, Spanish, and Vietnamese.
Other languages are classified with varying levels of accuracy: high (5%-10% error rate), good (10%-20% error rate), and moderate (25%-50% error rate).

Performance Benchmarks

According to ElevenLabs, Scribe has demonstrated superior performance compared to Google Gemini 2.0 Flash and Whisper Large V3 in benchmark tests utilizing FLEURS & Common Voice datasets. These results indicate a significant advancement in speech-to-text technology.

elevenlabs is launching its own speech-to-text model

Development and Future Plans

The speech-to-text component was initially developed for ElevenLabs’ AI conversational agent platform, released previously. This marks the first instance of the company offering a standalone speech detection model to the public.

CEO Mati Staniszewski, in a recent interview with TechCrunch, emphasized the importance of enhancing speech detection capabilities. He stated the company’s intention to move beyond content generation and focus on understanding and transcribing spoken language.

Advanced Features

Scribe incorporates several advanced features, including smart speaker diarization to identify speakers, word-level timestamps for precise subtitles, and automatic tagging of sound events, such as audience laughter. The platform also allows users to directly transcribe video content for adding subtitles or captions within the ElevenLabs studio.

Current Limitations and Future Updates

Currently, Scribe is designed to process pre-recorded audio files. A low-latency, real-time version of the model is planned for release in the near future. This will enable applications like live meeting transcriptions and voice note capture, which are not currently supported.

Pricing Information

ElevenLabs is offering Scribe at a rate of $0.40 per hour of transcribed audio. While this pricing is competitive, some competitors currently provide lower rates for audio transcriptions, albeit with potential differences in feature sets.

Topics

More

ElevenLabs Launches Speech-to-Text Model | AI Voice Technology

ElevenLabs Enters Speech-to-Text Arena with Scribe

Expanding Beyond Text-to-Speech

Multilingual Support and Accuracy

Performance Benchmarks

Advanced Features

Current Limitations and Future Updates

Pricing Information

Related Posts

ChatGPT Launches App Store for Developers

Pickle Robot Appoints Tesla Veteran as First CFO

Peripheral Labs: Self-Driving Car Sensors Enhance Sports Fan Experience

Luma AI: Generate Videos from Start and End Frames

Alexa+ Adds AI to Ring Doorbells - Amazon's New Feature

Amazon Appoints Peter DeSantis to Lead New AI Organization