Skip to content

Models

MLX Audio supports a wide range of audio models across four categories, all optimized for Apple Silicon.

Many hosted MLX checkpoints referenced in these docs live under mlx-community on Hugging Face, the shared org for ready-to-use MLX model weights across projects like mlx-lm, mlx-vlm, and mlx-audio. If you are adding a new model, prefer publishing it there when possible so users can find MLX models in one consistent place.

Text-to-Speech (TTS)

Generate natural-sounding speech from text. Multiple models with multilingual support, voice cloning, and style control.

Model Description Languages Repo
Kokoro Fast, high-quality multilingual TTS EN, JA, ZH, FR, ES, IT, PT, HI mlx-community/Kokoro-82M-bf16
Qwen3-TTS Alibaba's multilingual TTS with voice design ZH, EN, JA, KO, + more mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16
Voxtral TTS Mistral's 4B multilingual TTS (20 voices, 9 languages) EN, FR, ES, DE, IT, PT, NL, AR, HI mlx-community/Voxtral-4B-TTS-2603-mlx-bf16
CSM Conversational Speech Model with voice cloning EN mlx-community/csm-1b
Dia Dialogue-focused TTS EN mlx-community/Dia-1.6B-fp16
Chatterbox Expressive multilingual TTS EN, ES, FR, DE, IT, PT, + more mlx-community/chatterbox-fp16
KugelAudio 7B multilingual TTS for 24 European languages 24 European languages kugelaudio/kugelaudio-0-open
Soprano High-quality TTS EN mlx-community/Soprano-1.1-80M-bf16
OuteTTS Efficient TTS model EN mlx-community/OuteTTS-1.0-0.6B-fp16
Spark SparkTTS model EN, ZH mlx-community/Spark-TTS-0.5B-bf16
Ming Omni TTS (BailingMM) Multimodal generation with voice cloning and style control EN, ZH mlx-community/Ming-omni-tts-16.8B-A3B-bf16
Ming Omni TTS (Dense) Lightweight dense Ming Omni variant EN, ZH mlx-community/Ming-omni-tts-0.5B-bf16

Browse TTS Models


Speech-to-Text (STT)

Transcribe and understand speech with state-of-the-art accuracy. Streaming support, word-level timestamps, and speaker diarization.

Model Description Languages Repo
Whisper OpenAI's robust STT model 99+ languages mlx-community/whisper-large-v3-turbo-asr-fp16
Distil-Whisper Distilled fast Whisper variants EN distil-whisper/distil-large-v3
Qwen3-ASR Alibaba's multilingual ASR ZH, EN, JA, KO, + more mlx-community/Qwen3-ASR-1.7B-8bit
Qwen3-ForcedAligner Word-level audio alignment ZH, EN, JA, KO, + more mlx-community/Qwen3-ForcedAligner-0.6B-8bit
Parakeet NVIDIA's accurate STT EN (v2), 25 EU languages (v3) mlx-community/parakeet-tdt-0.6b-v3
Voxtral Mistral's speech model Multiple mlx-community/Voxtral-Mini-3B-2507-bf16
Voxtral Realtime Mistral's 4B streaming STT Multiple 4bit / fp16
VibeVoice-ASR Microsoft's 9B ASR with diarization Multiple mlx-community/VibeVoice-ASR-bf16
Qwen2-Audio Audio-language model for transcription, translation, and audio understanding Multiple mlx-community/Qwen2-Audio-7B-Instruct-4bit
Canary NVIDIA's multilingual ASR with translation 25 EU + RU, UK --
Moonshine Useful Sensors' lightweight ASR EN --
MMS Meta's massively multilingual ASR 1000+ languages --
Granite Speech IBM's ASR + speech translation EN, FR, DE, ES, PT, JA --

Browse STT Models


Voice Activity Detection / Speaker Diarization (VAD)

Detect speech segments and identify speakers in audio.

Model Description Repo
Sortformer v1 NVIDIA's end-to-end speaker diarization (up to 4 speakers) mlx-community/diar_sortformer_4spk-v1-fp32
Sortformer v2.1 Streaming speaker diarization with AOSC compression mlx-community/diar_streaming_sortformer_4spk-v2.1-fp32
Smart Turn Endpoint detection for conversational turn-taking mlx-community/smart-turn-v3

VAD Models


Speech-to-Speech (STS)

Transform, separate, and enhance audio.

Model Description Use Case Repo
SAM-Audio Text-guided source separation Extract specific sounds mlx-community/sam-audio-large
Liquid2.5-Audio Speech-to-Speech, TTS, and STT Speech interactions mlx-community/LFM2.5-Audio-1.5B-8bit
MossFormer2 SE Speech enhancement Noise removal starkdmi/MossFormer2_SE_48K_MLX
DeepFilterNet (1/2/3) Speech enhancement Noise suppression mlx-community/DeepFilterNet-mlx

STS Models