Skip to content

Quick Start: CLI

mlx-audio provides command-line tools for both text-to-speech generation and speech-to-text transcription.

Text-to-Speech

Note

These TTS quickstart examples use mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit.

Basic Generation

Generate speech from text with a single command:

mlx_audio.tts.generate \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
    --text "Hello, world!" \
    --voice Chelsie \
    --lang_code English

Play Audio Immediately

Add --play to hear the result without saving:

mlx_audio.tts.generate \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
    --text "Hello, world!" \
    --voice Chelsie \
    --lang_code English \
    --play

Voice and Language Selection

Choose a voice preset and provide a language hint:

mlx_audio.tts.generate \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
    --text "Welcome to MLX-Audio!" \
    --voice Ethan \
    --lang_code English

Save to a Directory

mlx_audio.tts.generate \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
    --text "Hello!" \
    --voice Chelsie \
    --lang_code English \
    --output_path ./my_audio

Voice Cloning (CSM)

Clone a voice from a reference audio file:

mlx_audio.tts.generate \
    --model mlx-community/csm-1b \
    --text "Hello from Sesame." \
    --ref_audio ./reference_voice.wav \
    --play

Speech-to-Text

Transcribe with Whisper

python -m mlx_audio.stt.generate \
    --model mlx-community/whisper-large-v3-turbo-asr-fp16 \
    --audio audio.wav \
    --output-path output \
    --format json \
    --verbose

Transcribe with Parakeet

python -m mlx_audio.stt.generate \
    --model mlx-community/parakeet-tdt-0.6b-v3 \
    --audio speech.wav \
    --output-path output \
    --format json \
    --verbose

Transcribe with VibeVoice-ASR

python -m mlx_audio.stt.generate \
    --model mlx-community/VibeVoice-ASR-bf16 \
    --audio meeting.wav \
    --output-path output \
    --format json \
    --max-tokens 8192 \
    --verbose

Add context/hotwords for better accuracy on domain-specific terms:

python -m mlx_audio.stt.generate \
    --model mlx-community/VibeVoice-ASR-bf16 \
    --audio technical_talk.wav \
    --output-path output \
    --format json \
    --max-tokens 8192 \
    --context "MLX, Apple Silicon, PyTorch, Transformer" \
    --verbose

Common Flags

Flag Description
--model Hugging Face model ID or local path
--text Input text for TTS generation
--audio Input audio file for STT transcription
--voice Voice preset name (e.g., Chelsie, Ethan, casual_male)
--speed Speech speed multiplier (default: 1.0)
--lang_code Language hint (e.g., English, Chinese, or auto)
--play Play audio immediately after generation
--output_path Directory to save output files
--verbose Show detailed generation info
--format Output format for STT (json, etc.)

Note

Different models support different flags. Check the Models section for model-specific options.