Streaming Audio¶
MLX Audio supports streaming audio generation for low-latency playback. Instead of waiting for the entire utterance to be synthesized, you can start playing audio chunks as they are produced.
CLI Streaming¶
Add the --stream flag to any TTS generation command. When streaming is enabled, audio is played back in real time as chunks become available:
mlx_audio.tts.generate \
--model mlx-community/Kokoro-82M-bf16 \
--text "Hello, this is a streaming example!" \
--lang_code a \
--stream
Streaming implies playback
The --stream flag automatically enables --play, so there is no need to pass both.
Controlling Chunk Size¶
The --streaming_interval argument controls how frequently audio chunks are emitted (in seconds). Smaller values reduce latency but add per-chunk overhead:
mlx_audio.tts.generate \
--model mlx-community/Kokoro-82M-bf16 \
--text "Adjusting the streaming interval changes latency." \
--lang_code a \
--stream \
--streaming_interval 1.5
The default interval is 2.0 seconds.
Python Streaming¶
TTS Streaming¶
Every model's generate() method accepts stream=True. When enabled, it yields GenerationResult objects as chunks rather than waiting for full synthesis:
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/Kokoro-82M-bf16")
for result in model.generate(
text="This audio will stream chunk by chunk.",
voice="af_heart",
lang_code="a",
stream=True,
streaming_interval=2.0,
):
# result.audio is an mx.array with one chunk of audio
print(f"Chunk: {result.audio.shape[0]} samples")
# Feed result.audio to an audio player or buffer
Each GenerationResult chunk contains:
| Attribute | Description |
|---|---|
audio |
mx.array waveform for this chunk |
sample_rate |
Sample rate in Hz |
is_streaming_chunk |
True for intermediate chunks |
is_final_chunk |
True for the last chunk |
Qwen3-TTS Streaming¶
Qwen3-TTS models support streaming across all generation methods -- generate(), generate_custom_voice(), and generate_voice_design():
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-6bit")
audio_chunks = []
for result in model.generate(
text="Hello, how are you today?",
voice="serena",
stream=True,
streaming_interval=0.32, # ~4 tokens at 12.5 Hz
):
audio_chunks.append(result.audio)
# Play or process each chunk for low-latency output
Streaming interval for Qwen3-TTS
At 12.5 Hz token rate, a streaming_interval of 0.32 seconds corresponds to roughly 4 tokens per chunk. Lower values reduce latency but increase overhead.
STT Streaming¶
Several speech-to-text models support streaming transcription. Pass stream=True to generate():
STT Streaming via CLI¶
python -m mlx_audio.stt.generate \
--model mlx-community/whisper-large-v3-turbo-asr-fp16 \
--audio speech.wav \
--output-path output \
--format json \
--stream
API Server Streaming¶
The API server supports streaming for both TTS and STT. Set "stream": true in your request:
Streaming TTS¶
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Kokoro-82M-bf16",
"input": "Streaming over HTTP!",
"voice": "af_heart",
"stream": true,
"streaming_interval": 2.0,
"response_format": "wav"
}' \
--output streamed_speech.wav
Streaming STT¶
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=mlx-community/whisper-large-v3-turbo-asr-fp16" \
-F "stream=true"
Streaming STT returns newline-delimited JSON (application/x-ndjson), with each line containing a text field and optional timing information.
Real-Time WebSocket Transcription¶
The API server also exposes a WebSocket endpoint for real-time audio transcription:
This is useful for live microphone input or continuous audio streams.
Tips¶
- Latency vs. quality -- Shorter
streaming_intervalvalues give lower latency but may produce more chunk boundaries. Start with the default and decrease as needed. - Memory -- Streaming does not significantly change peak memory usage since model weights remain loaded; it only changes when decoded audio is returned.
- Joining chunks -- If you need a single contiguous file, concatenate the
result.audioarrays after the loop and write once withmlx_audio.audio_io.write().