ElevenLabs, an innovative AI startup, has made headlines after raising a significant $180 million funding round. Renowned for its expertise in audio-generation, the company is now venturing into speech detection technology by introducing its first stand-alone speech-to-text model, known as Scribe.
The startup, with a valuation of $3.3 billion, has already played a pivotal role in assisting numerous companies to enhance their speech-to-text services through its extensive library of voices. Now, ElevenLabs is poised to enter the competitive landscape of speech detection, going head-to-head with industry leaders such as Gladia, Speechmatics, AssemblyAI, Deepgram, and OpenAI’s Whisper models.
The Scribe model impressively supports over 99 languages at launch, placing more than 25 languages in the "excellent accuracy" category, where the word error rate is less than 5%. These languages include English (with a claimed accuracy rate of 97%), French, German, Hindi, Indonesian, Japanese, Kannada, Malayalam, Polish, Portuguese, Spanish, and Vietnamese. Other languages are classified into categories based on their word error rates: high (5% to 10%), good (10% to 20%), and moderate (25% to 50%).
Remarkably, the model has outperformed competitors like Google Gemini 2.0 Flash and Whisper Large V3 across multiple languages in the FLEURS & Common Voice benchmark tests.
ElevenLabs initially developed the speech-to-text component for its AI conversational agent platform released last year. However, this marks the first time the company is launching a stand-alone speech detection model. In an interview with TechCrunch last month, CEO Mati Staniszewski emphasized the company's focus on enhancing speech detection models. "We want to understand what's being said by you in a conversation better. We are working on ways to move away from only generating content and understanding and transcribing speech," Staniszewski explained. He further noted that while many consider speech-to-text a "solved problem," it remains inadequate for many languages. ElevenLabs aims to build superior speech detection models by leveraging its in-house teams for data annotation and rapid feedback.
The Scribe model boasts advanced features like smart speaker diarization to identify speakers, word-level timestamps for precise subtitles, and auto-tagging sound events such as audience laughter. Additionally, the startup offers a solution for customers to directly transcribe video content, facilitating the addition of subtitles or captions in its studio.
Currently, Scribe is compatible only with pre-recorded audio formats. However, ElevenLabs plans to release a low-latency real-time version of the model soon, broadening its applicability for meeting transcriptions and voice note-taking.