Speech-to-Text (STT) Models¶

MLX Audio provides a range of speech-to-text models optimized for Apple Silicon, from lightweight English-only models to large multilingual systems with translation capabilities.

Model Comparison¶

Model	Provider	Parameters	Languages	Streaming	Timestamps	Repo
Whisper	OpenAI	Various	99+	--	Segment + Word	mlx-community/whisper-large-v3-turbo-asr-fp16
Distil-Whisper	HuggingFace	Various	EN	--	Segment	distil-whisper/distil-large-v3
Parakeet	NVIDIA	0.6B	EN (v2), 25 EU (v3)	Yes	Sentence + Word	mlx-community/parakeet-tdt-0.6b-v3
Voxtral Realtime	Mistral	4B	Multiple	Yes	--	4bit, fp16
Qwen3-ASR	Alibaba	0.6B / 1.7B	ZH, EN, JA, KO + more	Yes	Segment	mlx-community/Qwen3-ASR-1.7B-8bit
Qwen3-ForcedAligner	Alibaba	0.6B	ZH, EN, JA, KO + more	--	Word-level	mlx-community/Qwen3-ForcedAligner-0.6B-8bit
VibeVoice-ASR	Microsoft	9B	Multiple	Yes	Segment	mlx-community/VibeVoice-ASR-bf16
Voxtral	Mistral	3B	Multiple	--	--	mlx-community/Voxtral-Mini-3B-2507-bf16
Cohere Transcribe	Cohere	2B	14 languages	--	Segment	CohereLabs/cohere-transcribe-03-2026
Qwen2-Audio	Alibaba	7B	Multiple	--	--	mlx-community/Qwen2-Audio-7B-Instruct-4bit
Canary	NVIDIA	~1B	25 EU + RU, UK	--	--	README
SenseVoice	Alibaba DAMO	~234M	50+	--	--	mlx-community/SenseVoiceSmall
FireRedASR2	Xiaohongshu	~1.18B	ZH, EN	--	--	mlx-community/FireRedASR2-AED-mlx
Granite Speech	IBM	~1B	EN, FR, DE, ES, PT, JA	Yes	--	README
Moonshine	Useful Sensors	27M / 61M	EN	--	--	README
MMS	Meta	1B	1000+	--	--	README

Cohere quantized local checkpoints

Cohere Transcribe can be loaded from local MLX-converted checkpoints, including 8-bit and 4-bit variants. To generate these variants explicitly, use --q-bits 8 or --q-bits 4 together with --quantize.

8-bit: python -m mlx_audio.convert --model-domain stt --quantize --q-bits 8
4-bit: python -m mlx_audio.convert --model-domain stt --quantize --q-bits 4

You can also pass --q-group-size to control the quantization group size when needed.

Unified API¶

All STT models share the same loading interface:

PythonCLI

from mlx_audio.stt import load

model = load("mlx-community/whisper-large-v3-turbo-asr-fp16")
result = model.generate("audio.wav")
print(result.text)

mlx_audio.stt.generate \
  --model mlx-community/whisper-large-v3-turbo-asr-fp16 \
  --audio audio.wav \
  --verbose

Choosing a model

Best multilingual coverage: Whisper (99+ languages) or MMS (1000+ languages)
Best accuracy for English: Parakeet v2 or Whisper large-v3-turbo
Best for European languages: Parakeet v3 (25 languages) or Canary
Lowest latency / streaming: Voxtral Realtime (4bit variant)
Smallest footprint: Moonshine tiny (27M parameters)
Speaker diarization built-in: VibeVoice-ASR
Word-level alignment: Qwen3-ForcedAligner
Emotion / event detection: SenseVoice