Whisper¶

Whisper is OpenAI's robust speech-to-text model supporting 99+ languages. The MLX Audio implementation also natively supports distilled variants like Distil-Whisper.

Available Models¶

Model	Parameters	Description	Repo
whisper-large-v3-turbo	~809M	Fastest large model, multilingual	mlx-community/whisper-large-v3-turbo-asr-fp16
whisper-large-v3	~1.5B	Best accuracy, multilingual	mlx-community/whisper-large-v3
distil-large-v3	~756M	Distilled, English-focused	distil-whisper/distil-large-v3

Python Usage¶

Basic Transcription¶

from mlx_audio.stt import load

# Standard Whisper
model = load("mlx-community/whisper-large-v3-turbo-asr-fp16")

# Distil-Whisper
# model = load("distil-whisper/distil-large-v3")

result = model.generate("audio.wav")
print(result.text)

Segment-Level Timestamps¶

Segment-level timestamps are enabled by default:

result = model.generate("audio.wav")
for segment in result.segments:
    print(f"[{segment['start']:.2f} -> {segment['end']:.2f}] {segment['text']}")

Word-Level Timestamps¶

Word-level timestamps use cross-attention alignment heads via DTW:

result = model.generate("audio.wav", word_timestamps=True)
for segment in result.segments:
    for word in segment["words"]:
        print(f"[{word['start']:.2f} -> {word['end']:.2f}] {word['word']} (p={word['probability']:.3f})")

Disable Timestamps¶

result = model.generate("audio.wav", return_timestamps=False)

CLI Usage¶

Basic transcriptionWord-level timestampsExport JSON with word timestamps

mlx_audio.stt.generate \
  --model mlx-community/whisper-large-v3-turbo-asr-fp16 \
  --audio audio.wav \
  --verbose

mlx_audio.stt.generate \
  --model mlx-community/whisper-large-v3-turbo-asr-fp16 \
  --audio audio.wav \
  --verbose \
  --gen-kwargs '{"word_timestamps": true}'

mlx_audio.stt.generate \
  --model mlx-community/whisper-large-v3-turbo-asr-fp16 \
  --audio audio.wav \
  --format json \
  --output-path output.json \
  --gen-kwargs '{"word_timestamps": true}'

Language support

Whisper supports 99+ languages. The model automatically detects the spoken language, or you can specify it explicitly for better accuracy.

Distil-Whisper

Distil-Whisper variants are faster and smaller than the full Whisper models while maintaining strong accuracy for English transcription. Load them the same way -- just swap the model name.