Text-to-Speech Models¶

MLX-Audio supports a wide range of TTS models optimized for Apple Silicon. Each model offers different tradeoffs between speed, quality, languages, and features.

Model Comparison¶

Model	Size	Languages	Voice Cloning	Streaming	Key Features
Kokoro	82M	EN, JA, ZH, FR, ES, IT, PT, HI	--	--	Fast, 54 voice presets, speed control
KittenTTS	14.6M / 35.5M / 73.8M	EN	--	--	KittenTTS 0.8 nano/micro/mini, compact edge-friendly TTS, speed control
Qwen3-TTS	0.6B / 1.7B	ZH, EN, JA, KO, + more	Yes	Yes	Voice cloning, emotion control, voice design, batch generation
Higgs Audio v3	4B	100 languages	Yes	--	Conversational TTS, inline emotion/style/prosody controls, bundled Higgs codec
MOSS-TTS	8B / 1.7B	31 languages	Yes	--	Delay-pattern and local-transformer RVQ generation, full MOSS Audio Tokenizer
OmniVoice	0.6B backbone + HiggsAudio tokenizer	646+ languages	Yes	--	Zero-shot multilingual cloning, nonverbal tags, CMU + pinyin controls
Voxtral TTS	4B	EN, FR, ES, DE, IT, PT, NL, AR, HI	--	Yes	20 voice presets, 9 languages, chunked streaming output
Svara TTS	3B	19 Indian langs (HI, BN, TA, TE, KN, ML, MR, GU, PA, OR, AS, BH, MAG, MAI, HNE, BRX, DOI, NE, SA, EN-IN)	--	Yes	Orpheus-family, SNAC 24 kHz, 38 voices, 4-bit/8-bit MLX quants
CSM / MisoTTS	1B / 8B	EN	Yes	Yes	Sesame-style conversational speech, voice cloning, multi-turn context
Dia	1.6B	EN	--	--	Dialogue with `[S1]`/`[S2]` speaker tags
Chatterbox	--	EN + 15 languages	Yes	--	Expressive, emotion exaggeration control
KugelAudio	7B	24 European languages	--	--	VibeVoice-based multilingual TTS with diffusion decoding
Spark	0.5B	EN, ZH	--	--	SparkTTS model
OuteTTS	0.6B	EN	--	--	Efficient TTS
Soprano	80M	EN	--	--	High-quality TTS
Ming Omni TTS	16.8B (A3B) / 0.5B	EN, ZH	Yes	--	Voice cloning, style/emotion control, music & sound FX generation
TADA	1B / 3B	EN (1B), EN + 9 langs (3B)	Yes	--	HumeAI, speed control, flow matching
Echo TTS	--	EN	Yes	--	Diffusion-based, fast voice cloning
Irodori TTS	500M	JA	Yes	--	Japanese-only, DiT + DACVAE
Fish Speech	--	EN	Yes	--	Inline control tags, multi-speaker, long-form batching
VoxCPM2	2B	30 languages	Yes	--	48kHz, voice design, voice cloning, continuation

Quick Start¶

All TTS models share a common interface:

CLIPython

mlx_audio.tts.generate \
    --model <model-id> \
    --text "Hello, world!" \
    --voice <voice-name>

from mlx_audio.tts.utils import load_model

model = load_model("<model-id>")

for result in model.generate(text="Hello, world!"):
    audio = result.audio  # mx.array waveform

Choosing a model

Fastest / smallest: Kokoro (82M) -- great for quick generation with many voice presets.
Voice cloning: CSM, Qwen3-TTS, Higgs Audio v3, or OmniVoice -- clone a voice from reference speech.
Multilingual: Voxtral TTS (9 languages, 20 voices) or Chatterbox (16 languages).
Dialogue: Dia -- built-in support for multi-speaker conversations.
Emotion / style control: Qwen3-TTS CustomVoice or VoiceDesign variants.