LOGO

ElevenLabs Launches Speech-to-Text Model | AI Voice Technology

February 26, 2025
ElevenLabs Launches Speech-to-Text Model | AI Voice Technology

ElevenLabs Enters Speech-to-Text Arena with Scribe

ElevenLabs, a rapidly growing AI company recently securing $180 million in funding, is expanding its technological focus. While initially recognized for its advanced audio generation capabilities, the company has now introduced its first independent speech-to-text model, named Scribe.

Expanding Beyond Text-to-Speech

Valued at $3.3 billion, ElevenLabs has already established itself by providing text-to-speech services to numerous organizations, leveraging its extensive voice library. Now, the company aims to compete in the field of speech detection, challenging established players such as Gladia, Speechmatics, AssemblyAI, Deepgram, and OpenAI’s Whisper.

Multilingual Support and Accuracy

At launch, ElevenLabs’ Scribe model offers support for over 99 languages. The company has categorized languages based on accuracy, with over 25 achieving a word error rate of under 5%.

  • This high-accuracy group includes English (with a reported 97% accuracy rate), French, German, Hindi, Indonesian, Japanese, Kannada, Malayalam, Polish, Portuguese, Spanish, and Vietnamese.
  • Other languages are classified with varying levels of accuracy: high (5%-10% error rate), good (10%-20% error rate), and moderate (25%-50% error rate).

Performance Benchmarks

According to ElevenLabs, Scribe has demonstrated superior performance compared to Google Gemini 2.0 Flash and Whisper Large V3 in benchmark tests utilizing FLEURS & Common Voice datasets. These results indicate a significant advancement in speech-to-text technology.

elevenlabs is launching its own speech-to-text modelDevelopment and Future Plans

The speech-to-text component was initially developed for ElevenLabs’ AI conversational agent platform, released previously. This marks the first instance of the company offering a standalone speech detection model to the public.

CEO Mati Staniszewski, in a recent interview with TechCrunch, emphasized the importance of enhancing speech detection capabilities. He stated the company’s intention to move beyond content generation and focus on understanding and transcribing spoken language.

Advanced Features

Scribe incorporates several advanced features, including smart speaker diarization to identify speakers, word-level timestamps for precise subtitles, and automatic tagging of sound events, such as audience laughter. The platform also allows users to directly transcribe video content for adding subtitles or captions within the ElevenLabs studio.

Current Limitations and Future Updates

Currently, Scribe is designed to process pre-recorded audio files. A low-latency, real-time version of the model is planned for release in the near future. This will enable applications like live meeting transcriptions and voice note capture, which are not currently supported.

Pricing Information

ElevenLabs is offering Scribe at a rate of $0.40 per hour of transcribed audio. While this pricing is competitive, some competitors currently provide lower rates for audio transcriptions, albeit with potential differences in feature sets.

#elevenlabs#speech to text#ai voice#ai transcription#voice cloning