Quantization¶
Quantization reduces model size and can improve inference speed by storing weights at lower precision. MLX Audio supports several quantization formats through the mlx_audio.convert module.
Why Quantize?¶
| Benefit | Details |
|---|---|
| Smaller models | A 4-bit model is roughly 4x smaller than float16 |
| Faster inference | Lower-precision weights move through memory faster on Apple Silicon |
| Lower memory usage | Fit larger models into unified memory |
The trade-off is a potential reduction in audio quality, especially at very low bit widths.
Available Bit Widths¶
| Bits | Typical Use Case | Quality | Size Reduction |
|---|---|---|---|
| 3-bit | Maximum compression, acceptable for quick prototyping | Lower | ~5x vs fp16 |
| 4-bit | Good balance of quality and size | Good | ~4x vs fp16 |
| 6-bit | Near-lossless for most models | Very good | ~2.5x vs fp16 |
| 8-bit | Minimal quality loss | Excellent | ~2x vs fp16 |
Quantization Modes¶
MLX Audio supports several quantization modes:
| Mode | Description |
|---|---|
affine |
Standard affine quantization (default) |
mxfp4 |
Microscaling FP4 format |
mxfp8 |
Microscaling FP8 format |
nvfp4 |
NVIDIA FP4 format |
Converting and Quantizing Models¶
Use the mlx_audio.convert CLI to convert a HuggingFace model to MLX format with quantization:
Basic 4-bit Quantization¶
python -m mlx_audio.convert \
--hf-path prince-canuma/Kokoro-82M \
--mlx-path ./Kokoro-82M-4bit \
--quantize \
--q-bits 4
MXFP4 Quantization¶
python -m mlx_audio.convert \
--hf-path prince-canuma/Kokoro-82M \
--mlx-path ./Kokoro-82M-mxfp4 \
--quantize \
--q-mode mxfp4
Convert to bfloat16 (No Quantization)¶
python -m mlx_audio.convert \
--hf-path prince-canuma/Kokoro-82M \
--mlx-path ./Kokoro-82M-bf16 \
--dtype bfloat16
Upload to Hugging Face Hub¶
Add --upload-repo to push the converted model directly:
python -m mlx_audio.convert \
--hf-path prince-canuma/Kokoro-82M \
--mlx-path ./Kokoro-82M-4bit \
--quantize \
--q-bits 4 \
--upload-repo username/Kokoro-82M-4bit
Conversion Options Reference¶
| Flag | Description |
|---|---|
--hf-path |
Source HuggingFace model or local path |
--mlx-path |
Output directory for the converted model |
-q, --quantize |
Enable quantization |
--q-bits |
Bits per weight (e.g., 3, 4, 6, 8) |
--q-group-size |
Group size for quantization (defaults depend on mode) |
--q-mode |
Quantization mode: affine, mxfp4, mxfp8, nvfp4 |
--dtype |
Weight dtype when not quantizing: float16, bfloat16, float32 |
--upload-repo |
Upload converted model to HuggingFace Hub |
Using Pre-Quantized Models¶
The mlx-community organization on Hugging Face hosts many pre-quantized models ready to use. For example:
from mlx_audio.tts.utils import load_model
# Load a pre-quantized model -- no conversion needed
model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-6bit")
results = list(model.generate(
text="This is running from a 6-bit quantized model!",
voice="serena",
))
from mlx_audio.stt.utils import load
# 4-bit Voxtral for faster STT
model = load("mlx-community/Voxtral-Mini-4B-Realtime-2602-4bit")
result = model.generate("audio.wav")
print(result.text)
Quality vs. Performance Trade-offs¶
General guidance
Start with 4-bit or 6-bit quantization. Move to 8-bit only if you hear artifacts, or to 3-bit if you need the smallest possible model.
- TTS models tend to be sensitive to quantization at 3-bit. 4-bit is usually the sweet spot.
- STT models (e.g., Whisper, Voxtral Realtime) often tolerate 4-bit quantization with minimal accuracy loss.
- Larger models (1B+ parameters) generally tolerate lower bit widths better than smaller ones.
- Always listen to output when evaluating TTS quantization -- word error rate alone does not capture prosody degradation.
Mixed Quantization Recipes¶
MLX Audio also supports mixed-precision quantization recipes that apply different bit widths to different layers:
| Recipe | Description |
|---|---|
mixed_2_6 |
2-bit for some layers, 6-bit for others |
mixed_3_4 |
3-bit / 4-bit mix |
mixed_3_6 |
3-bit / 6-bit mix |
mixed_4_6 |
4-bit / 6-bit mix |
These can provide better quality-to-size ratios by keeping critical layers at higher precision.