Speech-to-Text

Mistral AI launches “Voxtral Transcribe 2” for real-time speech recognition

co-written by newsrooms.ai05. February 2026, 12:29

Startup Interviewer: Gib uns dein erstes AI Interview

French AI startup Mistral AI releases Voxtral Transcribe 2, two next-generation speech-to-text models designed to deliver state-of-the-art transcription quality and “ultra-low latency”. The family comprises Voxtral Mini Transcribe V2 for batch transcription and Voxtral Realtime for live applications.

Voxtral Realtime is available as an open-source model under the Apache 2.0 license. The model addresses applications where latency is critical. Realtime uses a novel streaming architecture that transcribes audio as it arrives. According to Mistral, the model delivers transcriptions with latency under 200 milliseconds and unlocks a new class of voice-based applications.

The new speech model family natively supports 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.

Challenge to ChatGPT and competitors

With Voxtral Mini Transcribe V2, Mistral AI launches a transcription model that clearly sets itself apart from established solutions like ChatGPT. The model aims to deliver improvements in transcription and speaker recognition quality and work reliably across different languages and use cases. With a word error rate of around four percent on the FLEURS benchmark, Voxtral achieves very high accuracy — at just $0.003 per minute. This makes it currently one of the most attractive offerings on the market.

In direct comparison, Voxtral Mini Transcribe V2 is said to outperform models like GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal, and Deepgram Nova in accuracy. At the same time, according to Mistral, it processes audio data roughly three times faster than ElevenLabs Scribe v2 at comparable quality and about one-fifth the cost.

Technical design and enterprise suitability

Technically, Voxtral 2 is clearly designed as a cost-effective enterprise solution. Context biasing is currently optimized for English — the model is trained on specific words or phrases to ensure they are transcribed correctly. Additionally, the model shows low susceptibility to background noise and is said to deliver stable results even in acoustically challenging environments such as factory floors or call centers.

For testing, the AI company provides an audio playground in Mistral Studio. There, up to ten audio files can be uploaded simultaneously, speaker recognition can be enabled or disabled, timestamp granularity can be selected, and context bias terms can be added. Common audio formats such as MP3, WAV, M4A, FLAC, and OGG are supported, with a maximum file size of one gigabyte per file.

Data protection, availability, and pricing

As a European company, Mistral AI wants to convince with independence from US solutions. Both new Voxtral models support GDPR-compliant deployments, for example on-premise or in private cloud environments. Voxtral Mini Transcribe V2 is available immediately via API at a price of $0.003 per minute. Additionally, Voxtral Realtime for real-time applications is available at $0.006 per minute and is also available as an open-weights model on Hugging Face.