Svara TTS¶
Multilingual autoregressive text-to-speech for 19 Indian languages, in the Orpheus / SNAC family. Based on kenpath/svara-tts-v1 — a Llama-3.2-3B fine-tune over Canopy Labs' canopylabs/3b-hi-ft-research_release Orpheus base, paired with the SNAC 24 kHz neural codec.
Model Variants¶
| Model | Format | Size | HuggingFace |
|---|---|---|---|
mlx-community/svara-tts-v1-4bit |
MLX 4-bit | ~1.9 GB | Model Card |
mlx-community/svara-tts-v1-8bit |
MLX 8-bit | ~3.5 GB | Model Card |
Usage¶
import numpy as np
import soundfile as sf
import mlx.core as mx
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/svara-tts-v1-4bit")
chunks = []
for result in model.generate(
text="नमस्ते, आप कैसे हैं? मैं ठीक हूँ।",
voice="Hindi (Female)",
temperature=0.75,
top_p=0.9,
top_k=40,
repetition_penalty=1.1,
max_tokens=1200,
):
chunks.append(result.audio)
audio = mx.concatenate(chunks, axis=0)
sf.write("hello_hi.wav", np.asarray(audio), model.sample_rate)
Voices¶
Voice names follow the form "<Language Name> (<Gender>)":
| Language | Voices |
|---|---|
| Hindi | Hindi (Male), Hindi (Female) |
| Bengali | Bengali (Male), Bengali (Female) |
| Marathi | Marathi (Male), Marathi (Female) |
| Telugu | Telugu (Male), Telugu (Female) |
| Kannada | Kannada (Male), Kannada (Female) |
| Tamil | Tamil (Male), Tamil (Female) |
| Malayalam | Malayalam (Male), Malayalam (Female) |
| Gujarati | Gujarati (Male), Gujarati (Female) |
| Punjabi | Punjabi (Male), Punjabi (Female) |
| Assamese | Assamese (Male), Assamese (Female) |
| Bhojpuri | Bhojpuri (Male), Bhojpuri (Female) |
| Magahi | Magahi (Male), Magahi (Female) |
| Maithili | Maithili (Male), Maithili (Female) |
| Chhattisgarhi | Chhattisgarhi (Male), Chhattisgarhi (Female) |
| Bodo | Bodo (Male), Bodo (Female) |
| Dogri | Dogri (Male), Dogri (Female) |
| Nepali | Nepali (Male), Nepali (Female) |
| Sanskrit | Sanskrit (Male), Sanskrit (Female) |
| English (Indian) | English (Indian) (Male), English (Indian) (Female) |
38 voices across 19 languages.
Sampling Recommendations¶
The upstream svara-tts-inference repo uses these defaults; they're a good starting point:
| Parameter | Value |
|---|---|
temperature |
0.75 |
top_p |
0.9 |
top_k |
40 |
repetition_penalty |
1.1 |
max_tokens |
1200–2048 |
Architecture¶
- Backbone: Llama-3.2-3B fine-tuned from Canopy Labs' Orpheus Hindi base.
- Codec: SNAC 24 kHz, 3-level hierarchical RVQ, 7 codes per ~10 ms frame.
- Output: 24 kHz mono PCM.
Internally, mlx-audio dispatches Svara to the generic Llama TTS loader (any model whose config.json declares model_type: llama and uses the SNAC token layout works out of the box). The SNAC codec is auto-loaded from mlx-community/snac_24khz.
Voice cloning
The shared Orpheus Llama loader exposes a ref_audio / ref_text voice-cloning path. Per the in-repo warning, it is known to be unreliable on Orpheus-family fine-tunes (including Svara) and is best avoided until upstream addresses the issue.
License¶
Apache 2.0 — see the parent model card for full details, training data, and evaluation.
Links¶
- Parent model (
kenpath/svara-tts-v1) - Orpheus Hindi base (
canopylabs/3b-hi-ft-research_release) - Reference inference repo (
Kenpath/svara-tts-inference) - Llama TTS source code