Skip to content

Qwen3-TTS

Alibaba's state-of-the-art multilingual TTS with three model variants covering voice cloning, emotion control, and voice design from text descriptions. Supports streaming and batched generation.

Model Variants

Model Method Description HuggingFace
mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 generate() Fast, predefined voices Model Card
mlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16 generate() Higher quality Model Card
mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-bf16 generate_custom_voice() Voices + emotion Model Card
mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-bf16 generate_custom_voice() Better emotion control Model Card
mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16 generate_voice_design() Create any voice from description Model Card

Usage

Basic Generation

mlx_audio.tts.generate \
    --model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 \
    --text "Hello, welcome to MLX-Audio!" \
    --voice Chelsie
from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16")
results = list(model.generate(
    text="Hello, welcome to MLX-Audio!",
    voice="Chelsie",
    language="English",
))

audio = results[0].audio  # mx.array

Voice Cloning

Clone any voice by providing a reference audio sample and its transcript:

from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16")
results = list(model.generate(
    text="Hello from Sesame.",
    ref_audio="sample_audio.wav",
    ref_text="This is what my voice sounds like.",
))

audio = results[0].audio  # mx.array

CustomVoice (Emotion Control)

Use predefined voices with emotion and style instructions:

from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-bf16")
results = list(model.generate_custom_voice(
    text="I'm so excited to meet you!",
    speaker="Vivian",
    language="English",
    instruct="Very happy and excited.",
))

audio = results[0].audio  # mx.array

VoiceDesign (Create Any Voice)

Create a voice from a free-form text description:

from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16")
results = list(model.generate_voice_design(
    text="Big brother, you're back!",
    language="English",
    instruct="A cheerful young female voice with high pitch and energetic tone.",
))

audio = results[0].audio  # mx.array

Streaming

All generation methods support stream=True for low-latency playback:

from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-6bit")

audio_chunks = []
for result in model.generate(
    text="Hello, how are you today?",
    voice="serena",
    stream=True,
    streaming_interval=0.32,  # ~4 tokens at 12.5Hz
):
    audio_chunks.append(result.audio)
    # Play or process each chunk for low-latency output

Batch Generation

Generate multiple texts with different voices in a single batched forward pass:

from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-6bit")

texts = [
    "Hello, how are you today?",
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is transforming the world.",
    "Good morning, welcome to the show!",
]
voices = ["serena", "vivian", "ryan", "aiden"]

for result in model.batch_generate(
    texts=texts,
    voices=voices,
    stream=True,
    streaming_interval=0.32,
):
    audio_chunk = result.audio       # mx.array [samples]
    seq_idx = result.sequence_idx    # which sequence (0-3)
    is_done = result.is_final_chunk  # True on last chunk

Batch throughput (6-bit, short prompt)

Batch TPS Throughput Avg TTFB Memory
1 20.8 1.67x 84.8ms 3.88GB
2 34.7 2.78x 78.0ms 3.92GB
4 53.2 4.26x 99.9ms 3.98GB
8 68.1 5.45x 140.5ms 4.10GB

Available Speakers

Chinese: Vivian, Serena, Uncle_Fu, Dylan (Beijing Dialect), Eric (Sichuan Dialect)

English: Ryan, Aiden