Speech-to-Text (STT) Models¶
MLX Audio provides a range of speech-to-text models optimized for Apple Silicon, from lightweight English-only models to large multilingual systems with translation capabilities.
Model Comparison¶
| Model | Provider | Parameters | Languages | Streaming | Timestamps | Repo |
|---|---|---|---|---|---|---|
| Whisper | OpenAI | Various | 99+ | -- | Segment + Word | mlx-community/whisper-large-v3-turbo-asr-fp16 |
| Distil-Whisper | HuggingFace | Various | EN | -- | Segment | distil-whisper/distil-large-v3 |
| Parakeet | NVIDIA | 0.6B | EN (v2), 25 EU (v3) | Yes | Sentence + Word | mlx-community/parakeet-tdt-0.6b-v3 |
| Voxtral Realtime | Mistral | 4B | Multiple | Yes | -- | 4bit, fp16 |
| Qwen3-ASR | Alibaba | 0.6B / 1.7B | ZH, EN, JA, KO + more | Yes | Segment | mlx-community/Qwen3-ASR-1.7B-8bit |
| Qwen3-ForcedAligner | Alibaba | 0.6B | ZH, EN, JA, KO + more | -- | Word-level | mlx-community/Qwen3-ForcedAligner-0.6B-8bit |
| VibeVoice-ASR | Microsoft | 9B | Multiple | Yes | Segment | mlx-community/VibeVoice-ASR-bf16 |
| Voxtral | Mistral | 3B | Multiple | -- | -- | mlx-community/Voxtral-Mini-3B-2507-bf16 |
| Cohere Transcribe | Cohere | 2B | 14 languages | -- | Segment | CohereLabs/cohere-transcribe-03-2026 |
| Qwen2-Audio | Alibaba | 7B | Multiple | -- | -- | mlx-community/Qwen2-Audio-7B-Instruct-4bit |
| Canary | NVIDIA | ~1B | 25 EU + RU, UK | -- | -- | README |
| SenseVoice | Alibaba DAMO | ~234M | 50+ | -- | -- | mlx-community/SenseVoiceSmall |
| FireRedASR2 | Xiaohongshu | ~1.18B | ZH, EN | -- | -- | mlx-community/FireRedASR2-AED-mlx |
| Granite Speech | IBM | ~1B | EN, FR, DE, ES, PT, JA | Yes | -- | README |
| Moonshine | Useful Sensors | 27M / 61M | EN | -- | -- | README |
| MMS | Meta | 1B | 1000+ | -- | -- | README |
Cohere quantized local checkpoints
Cohere Transcribe can be loaded from local MLX-converted checkpoints, including 8-bit and 4-bit variants. To generate these variants explicitly, use --q-bits 8 or --q-bits 4 together with --quantize.
- 8-bit:
python -m mlx_audio.convert --model-domain stt --quantize --q-bits 8 - 4-bit:
python -m mlx_audio.convert --model-domain stt --quantize --q-bits 4
You can also pass --q-group-size to control the quantization group size when needed.
Unified API¶
All STT models share the same loading interface:
Choosing a model
- Best multilingual coverage: Whisper (99+ languages) or MMS (1000+ languages)
- Best accuracy for English: Parakeet v2 or Whisper large-v3-turbo
- Best for European languages: Parakeet v3 (25 languages) or Canary
- Lowest latency / streaming: Voxtral Realtime (4bit variant)
- Smallest footprint: Moonshine tiny (27M parameters)
- Speaker diarization built-in: VibeVoice-ASR
- Word-level alignment: Qwen3-ForcedAligner
- Emotion / event detection: SenseVoice