Skip to content

Svara TTS

Multilingual autoregressive text-to-speech for 19 Indian languages, in the Orpheus / SNAC family. Based on kenpath/svara-tts-v1 — a Llama-3.2-3B fine-tune over Canopy Labs' canopylabs/3b-hi-ft-research_release Orpheus base, paired with the SNAC 24 kHz neural codec.

Model Variants

Model Format Size HuggingFace
mlx-community/svara-tts-v1-4bit MLX 4-bit ~1.9 GB Model Card
mlx-community/svara-tts-v1-8bit MLX 8-bit ~3.5 GB Model Card

Usage

mlx_audio.tts.generate \
    --model mlx-community/svara-tts-v1-4bit \
    --text "नमस्ते, आप कैसे हैं?" \
    --voice "Hindi (Female)" \
    --temperature 0.75 \
    --top_p 0.9
import numpy as np
import soundfile as sf
import mlx.core as mx
from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/svara-tts-v1-4bit")

chunks = []
for result in model.generate(
    text="नमस्ते, आप कैसे हैं? मैं ठीक हूँ।",
    voice="Hindi (Female)",
    temperature=0.75,
    top_p=0.9,
    top_k=40,
    repetition_penalty=1.1,
    max_tokens=1200,
):
    chunks.append(result.audio)

audio = mx.concatenate(chunks, axis=0)
sf.write("hello_hi.wav", np.asarray(audio), model.sample_rate)

Voices

Voice names follow the form "<Language Name> (<Gender>)":

Language Voices
Hindi Hindi (Male), Hindi (Female)
Bengali Bengali (Male), Bengali (Female)
Marathi Marathi (Male), Marathi (Female)
Telugu Telugu (Male), Telugu (Female)
Kannada Kannada (Male), Kannada (Female)
Tamil Tamil (Male), Tamil (Female)
Malayalam Malayalam (Male), Malayalam (Female)
Gujarati Gujarati (Male), Gujarati (Female)
Punjabi Punjabi (Male), Punjabi (Female)
Assamese Assamese (Male), Assamese (Female)
Bhojpuri Bhojpuri (Male), Bhojpuri (Female)
Magahi Magahi (Male), Magahi (Female)
Maithili Maithili (Male), Maithili (Female)
Chhattisgarhi Chhattisgarhi (Male), Chhattisgarhi (Female)
Bodo Bodo (Male), Bodo (Female)
Dogri Dogri (Male), Dogri (Female)
Nepali Nepali (Male), Nepali (Female)
Sanskrit Sanskrit (Male), Sanskrit (Female)
English (Indian) English (Indian) (Male), English (Indian) (Female)

38 voices across 19 languages.

Sampling Recommendations

The upstream svara-tts-inference repo uses these defaults; they're a good starting point:

Parameter Value
temperature 0.75
top_p 0.9
top_k 40
repetition_penalty 1.1
max_tokens 1200–2048

Architecture

  • Backbone: Llama-3.2-3B fine-tuned from Canopy Labs' Orpheus Hindi base.
  • Codec: SNAC 24 kHz, 3-level hierarchical RVQ, 7 codes per ~10 ms frame.
  • Output: 24 kHz mono PCM.

Internally, mlx-audio dispatches Svara to the generic Llama TTS loader (any model whose config.json declares model_type: llama and uses the SNAC token layout works out of the box). The SNAC codec is auto-loaded from mlx-community/snac_24khz.

Voice cloning

The shared Orpheus Llama loader exposes a ref_audio / ref_text voice-cloning path. Per the in-repo warning, it is known to be unreliable on Orpheus-family fine-tunes (including Svara) and is best avoided until upstream addresses the issue.

License

Apache 2.0 — see the parent model card for full details, training data, and evaluation.