Text-to-Speech Models¶
MLX-Audio supports a wide range of TTS models optimized for Apple Silicon. Each model offers different tradeoffs between speed, quality, languages, and features.
Model Comparison¶
| Model | Size | Languages | Voice Cloning | Streaming | Key Features |
|---|---|---|---|---|---|
| Kokoro | 82M | EN, JA, ZH, FR, ES, IT, PT, HI | -- | -- | Fast, 54 voice presets, speed control |
| Qwen3-TTS | 0.6B / 1.7B | ZH, EN, JA, KO, + more | Yes | Yes | Voice cloning, emotion control, voice design, batch generation |
| MOSS-TTS | 8B / 1.7B | 20 languages | Yes | -- | Delay-pattern and local-transformer RVQ generation, full MOSS Audio Tokenizer |
| OmniVoice | 0.6B backbone + HiggsAudio tokenizer | 646+ languages | Yes | -- | Zero-shot multilingual cloning, nonverbal tags, CMU + pinyin controls |
| Voxtral TTS | 4B | EN, FR, ES, DE, IT, PT, NL, AR, HI | -- | Yes | 20 voice presets, 9 languages, chunked streaming output |
| Svara TTS | 3B | 19 Indian langs (HI, BN, TA, TE, KN, ML, MR, GU, PA, OR, AS, BH, MAG, MAI, HNE, BRX, DOI, NE, SA, EN-IN) | -- | Yes | Orpheus-family, SNAC 24 kHz, 38 voices, 4-bit/8-bit MLX quants |
| CSM | 1B | EN | Yes | Yes | Conversational speech, voice cloning, multi-turn context |
| Dia | 1.6B | EN | -- | -- | Dialogue with [S1]/[S2] speaker tags |
| Chatterbox | -- | EN + 15 languages | Yes | -- | Expressive, emotion exaggeration control |
| KugelAudio | 7B | 24 European languages | -- | -- | VibeVoice-based multilingual TTS with diffusion decoding |
| Spark | 0.5B | EN, ZH | -- | -- | SparkTTS model |
| OuteTTS | 0.6B | EN | -- | -- | Efficient TTS |
| Soprano | 80M | EN | -- | -- | High-quality TTS |
| Ming Omni TTS | 16.8B (A3B) / 0.5B | EN, ZH | Yes | -- | Voice cloning, style/emotion control, music & sound FX generation |
| TADA | 1B / 3B | EN (1B), EN + 9 langs (3B) | Yes | -- | HumeAI, speed control, flow matching |
| Echo TTS | -- | EN | Yes | -- | Diffusion-based, fast voice cloning |
| Irodori TTS | 500M | JA | Yes | -- | Japanese-only, DiT + DACVAE |
| Fish Speech | -- | EN | Yes | -- | Inline control tags, multi-speaker, long-form batching |
| VoxCPM2 | 2B | 30 languages | Yes | -- | 48kHz, voice design, voice cloning, continuation |
Quick Start¶
All TTS models share a common interface:
Choosing a model
- Fastest / smallest: Kokoro (82M) -- great for quick generation with many voice presets.
- Voice cloning: CSM, Qwen3-TTS, or OmniVoice -- clone a voice from reference speech.
- Multilingual: Voxtral TTS (9 languages, 20 voices) or Chatterbox (16 languages).
- Dialogue: Dia -- built-in support for multi-speaker conversations.
- Emotion / style control: Qwen3-TTS CustomVoice or VoiceDesign variants.