Voice Cloning¶
Several MLX Audio models can clone a speaker's voice from a short reference audio sample. This guide covers the supported models and how to use them.
Overview¶
| Model | Method | Reference Audio Required | Notes |
|---|---|---|---|
| CSM | --ref_audio CLI / ref_audio kwarg |
Yes (WAV) | Conversational Speech Model from Sesame |
| Qwen3-TTS Base | ref_audio + ref_text kwargs |
Yes (WAV) + transcript | Alibaba multilingual TTS |
| OmniVoice | ref_audio + ref_text kwargs |
Yes (WAV) + transcript recommended | 646+ language zero-shot cloning, best with prompt preprocessing |
| Spark | ref_audio kwarg |
Yes | SparkTTS voice cloning |
| Chatterbox | ref_audio kwarg |
Yes | Expressive multilingual TTS |
| OuteTTS | ref_audio kwarg |
Yes | Efficient TTS with cloning |
| Ming Omni TTS | ref_audio kwarg |
Yes | Multimodal with voice cloning |
CSM (Conversational Speech Model)¶
CSM is the simplest path to voice cloning. Provide a WAV file of the target voice and CSM will match it:
CLI¶
mlx_audio.tts.generate \
--model mlx-community/csm-1b \
--text "Hello from Sesame." \
--ref_audio ./reference_voice.wav \
--play
Python¶
from mlx_audio.tts.utils import load_model
from mlx_audio.utils import load_audio
model = load_model("mlx-community/csm-1b")
for result in model.generate(
text="Hello from Sesame.",
ref_audio="./reference_voice.wav",
):
audio = result.audio
Automatic transcription
If you do not provide --ref_text, MLX Audio will automatically transcribe the reference audio using a Whisper model. You can specify which STT model to use with --stt_model.
Qwen3-TTS¶
Qwen3-TTS Base models support voice cloning by providing both a reference audio file and its transcript:
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16")
results = list(model.generate(
text="Hello, welcome to MLX-Audio!",
ref_audio="sample_audio.wav",
ref_text="This is what my voice sounds like.",
))
audio = results[0].audio # mx.array
CustomVoice (Emotion Control)¶
The CustomVoice variant lets you combine a predefined voice with emotion and style instructions:
model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-bf16")
results = list(model.generate_custom_voice(
text="I'm so excited to meet you!",
speaker="Vivian",
language="English",
instruct="Very happy and excited.",
))
VoiceDesign (Create Any Voice)¶
The VoiceDesign variant creates a voice from a text description -- no reference audio needed:
model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16")
results = list(model.generate_voice_design(
text="Big brother, you're back!",
language="English",
instruct="A cheerful young female voice with high pitch and energetic tone.",
))
Available Qwen3-TTS Models¶
| Model | Method | Description |
|---|---|---|
Qwen3-TTS-12Hz-0.6B-Base-bf16 |
generate() |
Fast, predefined voices + cloning |
Qwen3-TTS-12Hz-1.7B-Base-bf16 |
generate() |
Higher quality |
Qwen3-TTS-12Hz-0.6B-CustomVoice-bf16 |
generate_custom_voice() |
Voices + emotion |
Qwen3-TTS-12Hz-1.7B-CustomVoice-bf16 |
generate_custom_voice() |
Better emotion control |
Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16 |
generate_voice_design() |
Create any voice from a description |
Available Speakers (Base / CustomVoice)¶
- Chinese:
Vivian,Serena,Uncle_Fu,Dylan(Beijing Dialect),Eric(Sichuan Dialect) - English:
Ryan,Aiden
Ming Omni TTS¶
Ming Omni TTS supports voice cloning and style control:
mlx_audio.tts.generate \
--model mlx-community/Ming-omni-tts-16.8B-A3B-bf16 \
--text "This is a Ming Omni voice cloning test." \
--ref_audio ./reference_voice.wav \
--lang_code en \
--verbose
See the Ming Omni TTS model page for detailed cookbook examples.
Best Practices¶
Preparing Reference Audio¶
Quality matters
The quality of your cloned voice depends heavily on the reference audio.
- Duration -- 5 to 15 seconds of clean speech works best. Very short clips lack enough speaker information; very long clips may confuse the model.
- Format -- Use WAV at 16 kHz or higher. The library will resample automatically, but starting with a good sample rate avoids artifacts.
- Noise -- Record in a quiet environment. Background noise will be cloned along with the voice. Consider using MossFormer2 or DeepFilterNet to enhance noisy recordings first.
- Content -- Natural, conversational speech produces the best results. Avoid whispering or shouting unless that is the target style.
Providing Reference Text¶
Some models (Qwen3-TTS) require a transcript of the reference audio (ref_text). If you omit it, MLX Audio will transcribe the clip automatically using Whisper:
mlx_audio.tts.generate \
--model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 \
--text "Cloned speech output." \
--ref_audio reference.wav \
--stt_model mlx-community/whisper-large-v3-turbo-asr-fp16
Providing the transcript yourself avoids loading the STT model and speeds up generation.
OmniVoice¶
OmniVoice supports multilingual zero-shot voice cloning with a HiggsAudioV2 acoustic tokenizer and iterative masked generation.
ref_text must match preprocessed audio
OmniVoice preprocessing removes silence and trims the reference clip. If you transcribe the original file, the ASR transcript may be longer than the preprocessed audio, causing the extra text to leak into generation. Always transcribe the preprocessed audio, not the raw recording. See examples/omnivoice_clone_demo.py for the correct workflow.
from mlx_audio.tts.utils import load_model as load_tts
from mlx_audio.tts.models.omnivoice.utils import create_voice_clone_prompt
from mlx_audio.stt.utils import load_model as load_stt
from mlx_audio.audio_io import write as audio_write
import mlx.core as mx, numpy as np, tempfile, os
tts = load_tts("mlx-community/OmniVoice-bf16")
# Preprocess → encode → decode → transcribe (matches original pipeline)
ref_tokens = create_voice_clone_prompt("reference.wav", tokenizer=tts.audio_tokenizer)
mx.eval(ref_tokens)
preprocessed = np.array(tts.audio_tokenizer.decode(ref_tokens).astype(mx.float32))
tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
audio_write(tmp.name, preprocessed, 24000); tmp.close()
ref_text = load_stt("mlx-community/Qwen3-ASR-0.6B-8bit").generate(tmp.name).text
os.unlink(tmp.name)
results = list(tts.generate(
text="Hello from OmniVoice.",
language="english",
ref_tokens=ref_tokens,
ref_text=ref_text,
))
audio_write("output.wav", np.array(results[0].audio), results[0].sample_rate)
OmniVoice-specific notes¶
- Reference text is required for stable cloning. Without it, output quality degrades significantly — garbled speech, wrong language, or missing words.
- Transcribe after preprocessing, not before. The original k2-fsa/OmniVoice runs Whisper on audio after silence removal. This demo replicates that approach.
- Prompt preprocessing matters. MLX Audio mirrors the original Python pipeline with RMS normalization, silence removal, trimming at silence gaps, and torchaudio-compatible resampling before reference encoding.
- Best reference length: roughly 5–15 seconds of actual speech after silence trimming.
- Supported inline controls: nonverbal tags such as
[laughter],[sigh], and pronunciation overrides for English CMU dictionary forms and Chinese pinyin forms.
Example: English CMU pronunciation control¶
results = list(model.generate(
text="He plays the [B EY1 S] guitar while catching a [B AE1 S] fish.",
language="english",
))
Example: Nonverbal tags¶
results = list(model.generate(
text="I just heard the funniest joke [laughter] that was incredible.",
language="english",
))
Combining Cloning with Streaming¶
Voice cloning and streaming work together. Add --stream to any cloning command: