Skip to content

Higgs Audio v3 TTS

Higgs Audio v3 TTS is a Qwen3-backed conversational TTS model with fused multi-codebook audio token generation, inline control tokens, multilingual speech, and zero-shot voice cloning.

python -m mlx_audio.tts.generate \
  --model bosonai/higgs-audio-v3-tts-4b \
  --text "Hello from Higgs Audio v3 on MLX."

Voice cloning

Pass one or more reference clips with matching transcripts:

python -m mlx_audio.tts.generate \
  --model bosonai/higgs-audio-v3-tts-4b \
  --text "Have a nice day and enjoy the sunshine." \
  --ref_audio reference.wav \
  --ref_text "Reference transcript."

Multiple references use repeated CLI flags:

python -m mlx_audio.tts.generate \
  --model bosonai/higgs-audio-v3-tts-4b \
  --text "Let's keep the same voice across this line." \
  --ref_audio speaker_1.wav \
  --ref_text "First reference transcript." \
  --ref_audio speaker_2.wav \
  --ref_text "Second reference transcript."

Python

from mlx_audio.tts.utils import load
from mlx_audio.audio_io import write as audio_write

model = load("bosonai/higgs-audio-v3-tts-4b")

for result in model.generate(
    text="Hello from Higgs Audio v3 on MLX.",
    ref_audio="reference.wav",
    ref_text="Reference transcript.",
    temperature=1.0,
    max_new_tokens=2048,
):
    audio_write("output.wav", result.audio, result.sample_rate)

Controls

Inline control tokens from the upstream model can be placed directly in the input text, for example:

<|emotion:amusement|><|prosody:expressive_high|>That was unexpected. <|sfx:laughter|>Hehe.

For sound-effect tags, follow the upstream guidance and include matching written onomatopoeia after the tag.

Notes