Quick Start: Python¶
Use the mlx-audio Python API to generate speech, transcribe audio, and process audio programmatically.
Text-to-Speech¶
Note
These TTS quickstart examples use mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit.
Basic Generation¶
from mlx_audio.tts.utils import load_model
# Load a TTS model
model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit")
# Generate speech
for result in model.generate(
"Hello from MLX-Audio!",
voice="Chelsie",
lang_code="English",
):
print(f"Generated {result.audio.shape[0]} samples")
# result.audio contains the waveform as mx.array
Voice and Language Options¶
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit")
for result in model.generate(
text="Welcome to MLX-Audio!",
voice="Ethan",
lang_code="English",
):
audio = result.audio
Qwen3-TTS Base Model
Pass voice to pick a preset speaker and lang_code to hint the language, for example voice="Chelsie" and lang_code="English".
Voxtral TTS¶
from mlx_audio.tts.utils import load
model = load("mlx-community/Voxtral-4B-TTS-2603-mlx-bf16")
for result in model.generate(text="Hello, how are you today?", voice="casual_male"):
print(result.audio_duration)
Qwen3-TTS¶
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit")
results = list(model.generate(
text="Hello, welcome to MLX-Audio!",
voice="Chelsie",
lang_code="English",
))
audio = results[0].audio # mx.array
Speech-to-Text¶
Whisper¶
from mlx_audio.stt.generate import generate_transcription
result = generate_transcription(
model="mlx-community/whisper-large-v3-turbo-asr-fp16",
audio="audio.wav",
)
print(result.text)
Qwen3-ASR¶
from mlx_audio.stt import load
# Speech recognition
model = load("mlx-community/Qwen3-ASR-0.6B-8bit")
result = model.generate("audio.wav", language="English")
print(result.text)
Word-Level Alignment (Qwen3-ForcedAligner)¶
from mlx_audio.stt import load
aligner = load("mlx-community/Qwen3-ForcedAligner-0.6B-8bit")
result = aligner.generate("audio.wav", text="I have a dream", language="English")
for item in result:
print(f"[{item.start_time:.2f}s - {item.end_time:.2f}s] {item.text}")
Parakeet (with Timestamps)¶
from mlx_audio.stt.utils import load
model = load("mlx-community/parakeet-tdt-0.6b-v3")
result = model.generate("audio.wav")
print(f"Text: {result.text}")
# Word-level timestamps
for sentence in result.sentences:
print(f"[{sentence.start:.2f}s - {sentence.end:.2f}s] {sentence.text}")
Streaming Transcription¶
Several STT models support streaming for low-latency output:
Saving Audio¶
Save with STS Utilities¶
from mlx_audio.sts import save_audio
# Save an audio array to a WAV file
save_audio(audio_array, "output.wav", sample_rate=24000)
Note
Saving to MP3, FLAC, OGG, Opus, or Vorbis requires ffmpeg. WAV works without it.