Quick Start: CLI¶
mlx-audio provides command-line tools for both text-to-speech generation and speech-to-text transcription.
Text-to-Speech¶
Note
These TTS quickstart examples use mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit.
Basic Generation¶
Generate speech from text with a single command:
mlx_audio.tts.generate \
--model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
--text "Hello, world!" \
--voice Chelsie \
--lang_code English
Play Audio Immediately¶
Add --play to hear the result without saving:
mlx_audio.tts.generate \
--model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
--text "Hello, world!" \
--voice Chelsie \
--lang_code English \
--play
Voice and Language Selection¶
Choose a voice preset and provide a language hint:
mlx_audio.tts.generate \
--model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
--text "Welcome to MLX-Audio!" \
--voice Ethan \
--lang_code English
Save to a Directory¶
mlx_audio.tts.generate \
--model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
--text "Hello!" \
--voice Chelsie \
--lang_code English \
--output_path ./my_audio
Voice Cloning (CSM)¶
Clone a voice from a reference audio file:
mlx_audio.tts.generate \
--model mlx-community/csm-1b \
--text "Hello from Sesame." \
--ref_audio ./reference_voice.wav \
--play
Speech-to-Text¶
Transcribe with Whisper¶
python -m mlx_audio.stt.generate \
--model mlx-community/whisper-large-v3-turbo-asr-fp16 \
--audio audio.wav \
--output-path output \
--format json \
--verbose
Transcribe with Parakeet¶
python -m mlx_audio.stt.generate \
--model mlx-community/parakeet-tdt-0.6b-v3 \
--audio speech.wav \
--output-path output \
--format json \
--verbose
Transcribe with VibeVoice-ASR¶
python -m mlx_audio.stt.generate \
--model mlx-community/VibeVoice-ASR-bf16 \
--audio meeting.wav \
--output-path output \
--format json \
--max-tokens 8192 \
--verbose
Add context/hotwords for better accuracy on domain-specific terms:
python -m mlx_audio.stt.generate \
--model mlx-community/VibeVoice-ASR-bf16 \
--audio technical_talk.wav \
--output-path output \
--format json \
--max-tokens 8192 \
--context "MLX, Apple Silicon, PyTorch, Transformer" \
--verbose
Common Flags¶
| Flag | Description |
|---|---|
--model |
Hugging Face model ID or local path |
--text |
Input text for TTS generation |
--audio |
Input audio file for STT transcription |
--voice |
Voice preset name (e.g., Chelsie, Ethan, casual_male) |
--speed |
Speech speed multiplier (default: 1.0) |
--lang_code |
Language hint (e.g., English, Chinese, or auto) |
--play |
Play audio immediately after generation |
--output_path |
Directory to save output files |
--verbose |
Show detailed generation info |
--format |
Output format for STT (json, etc.) |
Note
Different models support different flags. Check the Models section for model-specific options.