Speech to Text AI Model & Provider Leaderboard

Compare word error rate, speed, and pricing across Speech to Text models and providers.

For further details, see our methodology page.

See Streaming Benchmarks

Text to Speech Arena

AI Speech Explorer

Highlights

WER Index (Non-streaming)

AA-WER v2 · % of words transcribed incorrectly · Lower is better

Speed Factor

Input audio seconds transcribed per second · Higher is better

Price

USD per 1000 minutes of audio · Lower is better

Artificial Analysis Word Error Rate Index (Non-streaming)

% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%)

Note: For Earnings22, if a model cannot reliably handle full-length audio due to time limits, we chunk to ~9 minutes (relevant to: GPT-4o Mini Transcribe, OpenAI; GPT-4o Transcribe, OpenAI; Nova 2 Pro, Amazon; Voxtral Mini Transcribe, Mistral). For models with even shorter time limits, we chunk to ~30 seconds (relevant to: Qwen3 ASR Flash, Alibaba; Parakeet TDT 0.6B V3, NVIDIA; Canary Qwen 2.5B, NVIDIA).

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.

AA-WER (Non-streaming) by Dataset

AA-WER (Non-streaming): AA-AgentTalk Dataset

% of words transcribed incorrectly on the AA-AgentTalk dataset · Lower is better

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

Cleaned Dataset Comparison

VoxPopuli: Cleaned vs Original Subset of Publicly Available Data

% WER (word error rate) · Lower is better

Sort by

Note: The cleaned versions remove transcription errors from the reference text, providing a more accurate ground truth for model evaluation.

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

API Benchmarks

Artificial Analysis Word Error Rate Index (Non-streaming) vs. Price

% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%) · USD per 1000 minutes of audio

Most attractive quadrant

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

Estimated cost in USD to transcribe 1,000 minutes of audio, normalized across providers with different billing models, and including billed reasoning tokens where available. Further detail on the methodology page.

Speed Factor

Input audio seconds transcribed per second · Higher is better

Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed. Reported Speed Factor values are medians across benchmark trials from the last 7 days; over-time chart points are daily medians. Artificial Analysis measurements are based on an audio duration of 10 minutes. Speed Factor may vary for other durations, particularly very short durations under 1 minute.

Price of Transcription

USD per 1000 minutes of audio

Summary of Key Metrics & Further Information

	Provider	Whisper Version				Further Details
Qwen3.5 Omni Flash	Alibaba Cloud		13.5%	73.4	0.00	Details
Qwen3.5 Omni Plus	Alibaba Cloud		3.5%	97.4	0.00	Details
Nova 2 Pro	Amazon Bedrock		4.9%	22.5	3.10	Details
Amazon Transcribe	Amazon Bedrock		4.1%	19.0	24.00	Details
Universal-3 Pro	AssemblyAI		3.1%	83.6	3.50	Details
Universal, AssemblyAI	AssemblyAI		3.8%	114.4	2.50	Details
MAI-Transcribe-1.5	Microsoft Azure		2.4%	202.0	6.00	Details
MAI-Transcribe-1	Microsoft Azure		2.6%	56.2	6.00	Details
transcribe-03-2026	Cohere		4.6%	64.2	0.00	Details
Nova-3	Deepgram		5.2%	504.4	4.30	Details
Nova-2	Deepgram		5.3%	479.2	4.30	Details
Base	Deepgram		10.7%	353.3	12.50	Details
Scribe v2	ElevenLabs		2.2%	34.0	3.67	Details
Scribe v1	ElevenLabs		3.0%	39.3	6.67	Details
Solaria-1, Gladia	Gladia		4.1%	60.6	4.07	Details
Gemini 3.1 Pro Preview (High)	Google		2.8%	6.8	18.15	Details
Gemini 3.1 Pro Preview (Low)	Google		3.6%	6.5	7.72	Details
Gemini 3 Flash (High)	Google		2.9%	16.9	13.70	Details
Gemini 2.5 Flash Lite	Google		5.2%	69.3	6.56	Details
Gemini 2.5 Flash	Google		5.1%	69.4	6.66	Details
Gemini 2.5 Pro	Google		2.9%	12.3	11.39	Details
Gemini 3.1 Flash-Lite Preview (Minimal)	Google		3.4%	75.4	5.83	Details
Gradium Speech-to-Text	Gradium		8.4%	2.3	13.00	Details
Grok Speech to Text, xAI	xAI		4.0%	100.5	1.67	Details
LLM Speech, Azure	Microsoft Azure		3.7%	57.5	6.00	Details
Voxtral Mini Transcribe 2	Mistral		3.6%	73.4	3.00	Details
Voxtral Mini Transcribe	Mistral		3.5%	72.7	2.00	Details
Voxtral Small	Mistral		2.8%	66.1	4.00	Details
Voxtral Mini	DeepInfra		3.8%	74.7	1.00	Details
Modulate STT Batch English VFast	Modulate		13.0%	188.9	0.42	Details
Parakeet TDT 0.6B V3, Togetherai	Together.ai		4.5%	991.4	1.50	Details
Canary Qwen 2.5B, NVIDIA	Replicate		4.3%	5.8	0.74	Details
Parakeet TDT 0.6B V2, NVIDIA	NVIDIA		6.4%	101.2	0.00	Details
Parakeet RNNT 1.1B	Replicate		5.4%	6.3	1.91	Details
GPT-4o Transcribe	OpenAI		4.0%	33.6	6.00	Details
GPT-4o Mini Transcribe	OpenAI		4.5%	48.3	3.00	Details
Rev AI	Rev AI		5.9%	12.9	3.33	Details
Smallest AI Pulse	Smallest.ai		4.4%	135.3	5.00	Details
Speechmatics Standard	Speechmatics		5.1%	67.5	4.00	Details
Speechmatics Enhanced	Speechmatics		4.0%	52.3	6.70	Details
Whisper Large v3 Turbo	Groq	v3 Turbo	4.6%	119.7	0.67	Details
Whisper Large v3 Turbo	Fireworks	v3 Turbo	4.7%	161.9	1.00	Details
Wizper Large v3	fal.ai	large-v3	4.7%	201.5	0.50	Details
Incredibly Fast Whisper	Replicate	large-v3	5.7%	54.3	1.49	Details
Whisper Large v3	Replicate	large-v3	10.1%	2.8	4.23	Details
Whisper Large v3	fal.ai	large-v3	4.1%	98.8	1.15	Details
Whisper Large v3	Fireworks	large-v3	4.6%	324.6	1.00	Details
Whisper Large v3	Together.ai	large-v3	4.5%	428.0	1.50	Details
Whisper Large v2	OpenAI	large-v2	4.1%	26.8	6.00	Details

Frequently Asked Questions

Fun-Realtime-ASR-preview leads with the lowest AA-WER (Artificial Analysis Word Error Rate) of 1.7% across 54 models evaluated.

The top speech to text models by accuracy (AA-WER) are: 1. Fun-Realtime-ASR-preview (1.7%), 2. Scribe v2, ElevenLabs (2.2%), 3. MAI-Transcribe-1.5 (2.4%), 4. MAI-Transcribe-1 (2.6%), 5. Gemini 3 Pro (High), Google (2.7%). Lower AA-WER indicates better transcription accuracy.

Parakeet TDT 0.6B V3, Togetherai is the fastest with a speed factor of 991.4x real-time, followed by Nova-3 (504.4x) and Nova-2 (479.2x). Higher speed factors mean faster transcription.

Modulate STT Batch English VFast is the most affordable at $0.417 per 1,000 minutes, followed by Wizper (L, v3), fal.ai ($0.50) and Whisper (L, v3, Turbo), Groq ($0.667).

Voxtral Small, Mistral is the most accurate open weights model with an AA-WER of 2.8%. There are 13 open weights models out of 54 total evaluated.

The top open weights speech to text models by accuracy are: 1. Voxtral Small, Mistral (AA-WER 2.8%), 2. Voxtral Mini Transcribe, Mistral (AA-WER 3.5%), 3. Voxtral Mini Transcribe 2, Mistral (AA-WER 3.6%).

The best model depends on your priorities. Use the scatter plots to visualize trade-offs between accuracy (AA-WER), speed, and price. For applications requiring high accuracy, prioritize models with lower AA-WER scores. For real-time applications, focus on speed factor. For cost-sensitive workloads, compare the price charts.

Speech to Text models & providers compared: MAI-Transcribe-1.5, MAI-Transcribe-1, Grok Speech to Text, xAI, Qwen3.5 Omni Flash, Qwen3.5 Omni Plus, Cohere Transcribe 03-2026, Gemini 3.1 Pro Preview (High), Gemini 3.1 Pro Preview (Low), Smallest AI Pulse, Voxtral Mini Transcribe 2, Universal-3 Pro, Scribe v2, Gemini 3 Flash (High), Nova 2 Pro, Gradium Speech-to-Text, LLM Speech, Azure, Parakeet TDT 0.6B V3, Togetherai, Gemini 2.5 Flash Lite, Canary Qwen 2.5B, Replicate, Voxtral Mini Transcribe, Voxtral Small, Voxtral Mini, Deepinfra, Gemini 2.5 Flash, Gemini 2.5 Pro, Parakeet TDT 0.6B V2, Solaria-1, GPT-4o Transcribe, GPT-4o Mini Transcribe, Scribe v1, Nova-3, Universal, Whisper (L, v3, Turbo), Groq, Whisper (L, v3, Turbo), Fireworks, Parakeet RNNT 1.1B, Replicate, Amazon Transcribe, Wizper (L, v3), fal.ai, Incredibly Fast, Replicate, Whisper (L, v3), Replicate, Whisper (L, v3), fal.ai, Whisper (L, v3), Fireworks, Whisper Large v3, together.ai, Nova-2, Standard, Enhanced, Large v2, Base, Rev AI, Gemini 3.1 Flash-Lite Preview (Minimal), Modulate STT Batch English VFast.

Speech to Text AI Model & Provider Leaderboard

Related Links

WER Index (Non-streaming)

Speed Factor

Price

Artificial Analysis Word Error Rate Index (Non-streaming)

Artificial Analysis Word Error Rate Index (Non-streaming)

Artificial Analysis Word Error Rate (AA-WER) Index

AA-WER (Non-streaming) by Dataset

AA-WER (Non-streaming): AA-AgentTalk Dataset

Artificial Analysis Word Error Rate (AA-WER) Index

Cleaned Dataset Comparison

VoxPopuli: Cleaned vs Original Subset of Publicly Available Data

Artificial Analysis Word Error Rate (AA-WER) Index

API Benchmarks

Artificial Analysis Word Error Rate Index (Non-streaming) vs. Price

Artificial Analysis Word Error Rate (AA-WER) Index

Price

Speed Factor

Speed Factor

Price of Transcription

Price

Summary of Key Metrics & Further Information

Frequently Asked Questions

Which is the most accurate speech to text model?

What are the top speech to text models?

Which is the fastest speech to text model?

Which is the cheapest speech to text model?

Which is the best open weights speech to text model?

What are the top open weights speech to text models?

How do I choose the best speech to text model?