Skip to content

Dia

Dia is a 1.6B parameter dialogue-focused TTS model. It natively supports multi-speaker conversations using [S1] and [S2] speaker tags, making it ideal for generating realistic dialogue audio.

Model Variants

Model Format HuggingFace
mlx-community/Dia-1.6B-fp16 float16 Model Card

Usage

Basic Dialogue Generation

mlx_audio.tts.generate \
    --model mlx-community/Dia-1.6B-fp16 \
    --text "[S1] Hey, have you tried MLX-Audio? [S2] Yes, it runs great on my Mac!"
from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/Dia-1.6B-fp16")

for result in model.generate(
    text="[S1] Hey, have you tried MLX-Audio? [S2] Yes, it runs great on my Mac!",
):
    audio = result.audio  # mx.array waveform

Multi-Turn Dialogue

Dia automatically splits text on [S1]/[S2] tags and generates each turn separately:

from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/Dia-1.6B-fp16")

dialogue = """[S1] Welcome to the show! Today we're talking about AI on Apple Silicon.
[S2] Thanks for having me. It's an exciting time for on-device inference.
[S1] Absolutely. What's been the biggest breakthrough?
[S2] I'd say the combination of unified memory and optimized frameworks like MLX."""

for result in model.generate(text=dialogue):
    audio = result.audio

With Reference Audio

from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/Dia-1.6B-fp16")

for result in model.generate(
    text="[S1] Hello, this is a voice cloning test.",
    ref_audio="reference.wav",
    ref_text="This is a sample of my voice.",
):
    audio = result.audio

Generation Parameters

Parameter Default Description
temperature 1.3 Sampling temperature
top_p 0.95 Top-p (nucleus) sampling threshold
split_pattern "\n" Pattern to split text into segments
max_tokens None Maximum number of tokens to generate

Dialogue format

Use [S1] and [S2] tags at the start of each speaker's line. Dia will automatically separate turns and generate distinct voices for each speaker.