Skip to content

Text-to-Speech Models

MLX-Audio supports a wide range of TTS models optimized for Apple Silicon. Each model offers different tradeoffs between speed, quality, languages, and features.

Model Comparison

Model Size Languages Voice Cloning Streaming Key Features
Kokoro 82M EN, JA, ZH, FR, ES, IT, PT, HI -- -- Fast, 54 voice presets, speed control
Qwen3-TTS 0.6B / 1.7B ZH, EN, JA, KO, + more Yes Yes Voice cloning, emotion control, voice design, batch generation
MOSS-TTS 8B / 1.7B 20 languages Yes -- Delay-pattern and local-transformer RVQ generation, full MOSS Audio Tokenizer
OmniVoice 0.6B backbone + HiggsAudio tokenizer 646+ languages Yes -- Zero-shot multilingual cloning, nonverbal tags, CMU + pinyin controls
Voxtral TTS 4B EN, FR, ES, DE, IT, PT, NL, AR, HI -- Yes 20 voice presets, 9 languages, chunked streaming output
Svara TTS 3B 19 Indian langs (HI, BN, TA, TE, KN, ML, MR, GU, PA, OR, AS, BH, MAG, MAI, HNE, BRX, DOI, NE, SA, EN-IN) -- Yes Orpheus-family, SNAC 24 kHz, 38 voices, 4-bit/8-bit MLX quants
CSM 1B EN Yes Yes Conversational speech, voice cloning, multi-turn context
Dia 1.6B EN -- -- Dialogue with [S1]/[S2] speaker tags
Chatterbox -- EN + 15 languages Yes -- Expressive, emotion exaggeration control
KugelAudio 7B 24 European languages -- -- VibeVoice-based multilingual TTS with diffusion decoding
Spark 0.5B EN, ZH -- -- SparkTTS model
OuteTTS 0.6B EN -- -- Efficient TTS
Soprano 80M EN -- -- High-quality TTS
Ming Omni TTS 16.8B (A3B) / 0.5B EN, ZH Yes -- Voice cloning, style/emotion control, music & sound FX generation
TADA 1B / 3B EN (1B), EN + 9 langs (3B) Yes -- HumeAI, speed control, flow matching
Echo TTS -- EN Yes -- Diffusion-based, fast voice cloning
Irodori TTS 500M JA Yes -- Japanese-only, DiT + DACVAE
Fish Speech -- EN Yes -- Inline control tags, multi-speaker, long-form batching
VoxCPM2 2B 30 languages Yes -- 48kHz, voice design, voice cloning, continuation

Quick Start

All TTS models share a common interface:

mlx_audio.tts.generate \
    --model <model-id> \
    --text "Hello, world!" \
    --voice <voice-name>
from mlx_audio.tts.utils import load_model

model = load_model("<model-id>")

for result in model.generate(text="Hello, world!"):
    audio = result.audio  # mx.array waveform

Choosing a model

  • Fastest / smallest: Kokoro (82M) -- great for quick generation with many voice presets.
  • Voice cloning: CSM, Qwen3-TTS, or OmniVoice -- clone a voice from reference speech.
  • Multilingual: Voxtral TTS (9 languages, 20 voices) or Chatterbox (16 languages).
  • Dialogue: Dia -- built-in support for multi-speaker conversations.
  • Emotion / style control: Qwen3-TTS CustomVoice or VoiceDesign variants.