Architecture¶
This page provides a high-level overview of the MLX Audio codebase, its module structure, and the key design patterns.
Module Layout¶
mlx_audio/
├── tts/ # Text-to-Speech
│ ├── models/ # Model implementations
│ │ ├── base.py # BaseModelArgs, GenerationResult, BatchGenerationResult
│ │ ├── kokoro/ # Kokoro TTS
│ │ ├── qwen3_tts/ # Qwen3-TTS (Base, CustomVoice, VoiceDesign)
│ │ ├── sesame/ # CSM (Conversational Speech Model)
│ │ ├── dia/ # Dia TTS
│ │ ├── chatterbox/ # Chatterbox TTS
│ │ ├── spark/ # Spark TTS
│ │ ├── outetts/ # OuteTTS
│ │ ├── soprano/ # Soprano TTS
│ │ ├── bailingmm/ # Ming Omni TTS (BailingMM)
│ │ ├── dense/ # Ming Omni TTS (Dense)
│ │ ├── voxtral_tts/# Voxtral TTS
│ │ └── ... # Other model implementations
│ ├── utils.py # load(), load_model(), model discovery
│ ├── generate.py # generate_audio() CLI entry point
│ └── audio_player.py # Real-time audio playback
│
├── stt/ # Speech-to-Text
│ ├── models/ # Model implementations
│ │ ├── whisper/ # OpenAI Whisper
│ │ ├── parakeet/ # NVIDIA Parakeet
│ │ ├── voxtral/ # Voxtral STT
│ │ ├── voxtral_realtime/ # Voxtral Realtime
│ │ ├── qwen3_asr/ # Qwen3-ASR & ForcedAligner
│ │ ├── vibevoice_asr/ # VibeVoice-ASR
│ │ ├── canary/ # NVIDIA Canary
│ │ ├── moonshine/ # Moonshine
│ │ ├── mms/ # Meta MMS
│ │ └── ...
│ ├── utils.py # load(), load_model(), load_audio()
│ └── generate.py # CLI for transcription
│
├── sts/ # Speech-to-Speech
│ ├── models/ # SAM-Audio, MossFormer2, DeepFilterNet, Liquid
│ ├── generate.py
│ └── voice_pipeline.py
│
├── vad/ # Voice Activity Detection
│ ├── models/ # Sortformer v1 & v2.1
│ └── utils.py
│
├── audio_io.py # Audio file reading/writing (miniaudio + ffmpeg)
├── server.py # FastAPI API server
├── convert.py # Model conversion and quantization
└── ui/ # Next.js Studio UI
Core Design Patterns¶
Model Loading Pipeline¶
All models (TTS, STT, STS) follow the same loading pattern:
User calls load("mlx-community/model-name")
│
▼
Download from HuggingFace (or resolve local path)
│
▼
Read config.json → determine model_type
│
▼
Look up model_type in MODEL_REMAPPING dict
│
▼
Dynamically import mlx_audio.{category}.models.{model_type}
│
▼
Instantiate ModelConfig from config dict
│
▼
Instantiate Model(config) as mlx.nn.Module
│
▼
Load .safetensors weights into the model
│
▼
Return initialized model
The shared loading logic lives in mlx_audio/utils.py (base_load_model, get_model_class, get_model_path, load_config). Each category (tts, stt) has a thin wrapper in its own utils.py that passes the category-specific MODEL_REMAPPING.
Generation Pattern (TTS)¶
TTS models implement generate() as a generator function that yields GenerationResult objects:
for result in model.generate(text="Hello!", voice="af_heart"):
# result.audio is an mx.array waveform
# result.sample_rate, result.audio_duration, etc.
This design supports both:
- Non-streaming: The generator yields one
GenerationResultper text segment. - Streaming: With
stream=True, the generator yields multiple smaller chunks withis_streaming_chunk=True, and a final chunk withis_final_chunk=True.
The generate_audio() function in mlx_audio/tts/generate.py wraps this pattern with CLI argument handling, audio file writing, and optional playback.
Generation Pattern (STT)¶
STT models expose a generate() method that returns a result object with a .text attribute:
Some models support streaming, where generate() returns an iterator of chunks instead.
Base Classes¶
BaseModelArgs (mlx_audio/tts/models/base.py)
A dataclass base with a from_dict() class method that filters unknown keys. This allows config.json files to contain extra fields without causing errors:
GenerationResult
Returned by TTS generate(). Key fields:
audio--mx.arraywaveformsample_rate-- int, Hztoken_count-- number of tokens generatedaudio_duration-- human-readable stringreal_time_factor-- speed relative to real timeprocessing_time_seconds-- wall-clock timepeak_memory_usage-- GBis_streaming_chunk/is_final_chunk-- streaming state
BatchGenerationResult
Similar to GenerationResult but includes sequence_idx for batched generation.
Audio I/O¶
The mlx_audio/audio_io.py module provides read() and write() functions:
- Reading: Uses miniaudio for WAV/MP3/FLAC; falls back to ffmpeg for M4A/AAC/OGG/Opus.
- Writing: Uses miniaudio for WAV; uses ffmpeg for all other formats.
- Handles format detection from file extension or magic bytes.
- Provides
sf_read()andsf_write()as drop-in replacements for thesoundfileAPI.
API Server¶
mlx_audio/server.py is a FastAPI application that exposes:
POST /v1/audio/speech-- OpenAI-compatible TTSPOST /v1/audio/transcriptions-- OpenAI-compatible STTPOST /v1/audio/separations-- Audio source separationGET /v1/models-- List loaded modelsPOST /v1/models-- Load a modelDELETE /v1/models-- Unload a modelWS /v1/audio/transcriptions/realtime-- WebSocket real-time transcription
The ModelProvider class manages loaded models with lazy loading and an async lock for thread safety.
Model Conversion¶
mlx_audio/convert.py handles:
- Downloading weights from HuggingFace
- Converting PyTorch weights to MLX
.safetensors - Applying quantization (affine, mxfp4, mxfp8, nvfp4)
- Uploading converted models back to HuggingFace Hub
Category-Specific Notes¶
TTS (mlx_audio/tts/)¶
- Models range from small (Kokoro at 82M parameters) to large (Ming Omni at 16.8B).
- Some models have multiple generation methods (e.g., Qwen3-TTS:
generate(),generate_custom_voice(),generate_voice_design()). - Voice cloning is handled through
ref_audioandref_textparameters passed togenerate(). - The
AudioPlayerclass inaudio_player.pyhandles real-time audio playback during streaming.
STT (mlx_audio/stt/)¶
- Models return structured results with
.text, and optionally.segments,.sentences, or timing information. - The CLI (
generate.py) supports multiple output formats:txt,srt,vtt,json.
STS (mlx_audio/sts/)¶
- Includes source separation (SAM-Audio), speech enhancement (MossFormer2, DeepFilterNet), and speech-to-speech (Liquid).
- The
voice_pipeline.pymodule provides higher-level voice processing workflows.
VAD (mlx_audio/vad/)¶
- Contains Sortformer models for speaker diarization.
- Supports both offline and streaming diarization modes.