Web UI & API Server¶

MLX Audio ships with a FastAPI-based API server and a Next.js web interface (Studio UI) for interactive audio generation and transcription.

Starting the Server¶

API Server Only¶

mlx_audio.server --host 0.0.0.0 --port 8000

API Server with Studio UI¶

Pass --start-ui to launch the Next.js web interface alongside the API server:

mlx_audio.server --host 0.0.0.0 --port 8000 --start-ui

The API will be available at http://localhost:8000 and the Studio UI at http://localhost:3000.

Server Options¶

Flag	Default	Description
`--host`	`localhost`	Host to bind the server to
`--port`	`8000`	Port for the API server
`--reload`	`false`	Auto-reload on code changes (development)
`--start-ui`	`false`	Start the Studio UI alongside the API
`--allowed-origins`	`*`	CORS allowed origins (space-separated)
`--log-dir`	`logs`	Directory for server logs
`--realtime-model`	`null`	Default model for `/v1/realtime` when the client omits `?model=`
`--realtime-transcription-delay-ms`	`null`	Transcription latency/quality knob for models that support it (e.g. `voxtral_realtime`)
`--vad-model`	`mlx-community/silero-vad`	Streaming VAD model used for server-side turn detection (`server_vad`) on `/v1/realtime`
`--tts-max-batch-size`	`8`	Maximum compatible TTS speech requests per continuous batch session

The two realtime flags also read from MLX_AUDIO_REALTIME_MODEL and MLX_AUDIO_REALTIME_TRANSCRIPTION_DELAY_MS if present; the CLI flags take precedence. --vad-model likewise reads from MLX_AUDIO_VAD_MODEL. The TTS batching flag also reads from MLX_AUDIO_TTS_MAX_BATCH_SIZE; the CLI flag takes precedence.

CORS Configuration¶

By default, all origins are allowed. To restrict origins:

mlx_audio.server --allowed-origins http://localhost:3000 https://myapp.example.com

Or set the MLX_AUDIO_ALLOWED_ORIGINS environment variable with a comma-separated list.

TTS Continuous Batching¶

Compatible non-streaming TTS /v1/audio/speech requests are routed through the server's continuous batching path when the model exposes a create_tts_batch_session(...) hook.

Requests that are not supported by the model's continuous batching hook continue through the existing serial or fixed-window batch paths.

mlx_audio.server --host 127.0.0.1 --port 8000 \
  --tts-max-batch-size 8

OpenAI-Compatible API¶

The server implements OpenAI-compatible endpoints, so existing code that targets the OpenAI audio API can point to MLX Audio with minimal changes.

Text-to-Speech¶

POST /v1/audio/speech

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Kokoro-82M-bf16",
    "input": "Hello, world!",
    "voice": "af_heart",
    "response_format": "mp3"
  }' \
  --output speech.mp3

Request body:

Field	Type	Default	Description
`model`	string	required	Model ID (HuggingFace repo or local path)
`input`	string	required	Text to synthesize
`voice`	string	`null`	Voice preset name
`speed`	float	`1.0`	Playback speed multiplier
`lang_code`	string	`"a"`	Language code
`response_format`	string	`"mp3"`	Output format: `mp3`, `wav`, `flac`, `ogg`, `opus`
`stream`	bool	`false`	Stream audio chunks
`streaming_interval`	float	`2.0`	Seconds between stream chunks
`temperature`	float	`0.7`	Sampling temperature
`max_tokens`	int	`1200`	Maximum generation tokens
`ref_audio`	string	`null`	Path to reference audio for voice cloning
`ref_text`	string	`null`	Transcript of reference audio
`instruct`	string	`null`	Style/emotion instruction

Streaming TTS¶

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Kokoro-82M-bf16",
    "input": "Streaming audio over HTTP.",
    "voice": "af_heart",
    "stream": true,
    "response_format": "wav"
  }' \
  --output streamed.wav

Speech-to-Text¶

POST /v1/audio/transcriptions

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=mlx-community/whisper-large-v3-turbo-asr-fp16"

Form fields:

Field	Type	Default	Description
`file`	file	required	Audio file to transcribe
`model`	string	required	STT model ID
`language`	string	`null`	Language code
`max_tokens`	int	`1024`	Maximum output tokens
`stream`	bool	`false`	Stream results as NDJSON
`context`	string	`null`	Hotwords or metadata to guide transcription
`verbose`	bool	`false`	Include extra details

Audio Source Separation¶

POST /v1/audio/separations

curl -X POST http://localhost:8000/v1/audio/separations \
  -F "file=@mixed.wav" \
  -F "model=mlx-community/sam-audio-large-fp16" \
  -F "description=speech"

Returns JSON with base64-encoded target and residual WAV buffers.

Model Management¶

# List loaded models
curl http://localhost:8000/v1/models

# Load a model
curl -X POST "http://localhost:8000/v1/models?model_name=mlx-community/Kokoro-82M-bf16"

# Unload a model
curl -X DELETE "http://localhost:8000/v1/models?model_name=mlx-community/Kokoro-82M-bf16"

Real-Time WebSocket Transcription¶

The server exposes two WebSocket endpoints for live transcription. Both accept 16-bit signed little-endian PCM audio; they differ in wire protocol and intended consumers.

`/v1/audio/transcriptions/realtime` (VAD-based streaming)¶

ws://localhost:8000/v1/audio/transcriptions/realtime

Send raw PCM frames as binary WebSocket messages; the server performs VAD, chunks on silence, and emits transcription JSON messages back. Uses the preloaded Whisper-style STT model — connect after calling POST /v1/models to select it.

`/v1/realtime` (OpenAI Realtime-compatible)¶

ws://localhost:8000/v1/realtime?model=<model-id>

Implements a subset of the OpenAI Realtime API wire protocol, so existing Realtime clients can target MLX Audio by swapping the base URL. Any STT model that exposes create_streaming_session works — including voxtral_realtime, which is streaming-first and recommended.

Model selection order (first match wins):

?model=<id> query parameter on connect.
session.update.model (or session.audio.input.transcription.model) event after connect.
--realtime-model CLI flag / MLX_AUDIO_REALTIME_MODEL env var.

If none is set, the server replies with an error event and closes the socket.

Client → server events (subset of OpenAI Realtime):

Type	Purpose
`session.update`	Change the model, input sample rate, transcription config, or turn detection mode
`input_audio_buffer.append`	Append a chunk of base64-encoded PCM16 to the current item
`input_audio_buffer.commit`	Signal end-of-utterance; the server drains deltas and emits `completed`. Manual-commit mode only — unused under `server_vad`

Declaring the input sample rate. The server assumes incoming PCM is 24 kHz by default (the OpenAI Realtime client convention). If you are sending audio at a different rate — e.g. a 16 kHz microphone capture or a 48 kHz file — tell the server via session.update so it resamples correctly. Without this, 16 kHz audio interpreted as 24 kHz would sound sped-up to the model and transcribe as garbage.

{
  "type": "session.update",
  "session": {
    "audio": {"input": {"format": {"type": "audio/pcm", "rate": 16000}}}
  }
}

This only declares your input rate. The server resamples from that to whatever rate the model expects internally — you never need to match the model's native rate.

Turn detection (server_vad). By default (turn_detection: null) the client owns turn boundaries and signals end-of-utterance with input_audio_buffer.commit. Set turn_detection to {"type": "server_vad"} via session.update and the server instead runs a streaming VAD (Silero by default — see --vad-model) over the incoming audio: it detects when the speaker starts and stops, emits input_audio_buffer.speech_started / input_audio_buffer.speech_stopped, and auto-commits each turn. In this mode the client never sends input_audio_buffer.commit.

{
  "type": "session.update",
  "session": {
    "audio": {
      "input": {
        "turn_detection": {
          "type": "server_vad",
          "threshold": 0.5,
          "prefix_padding_ms": 300,
          "silence_duration_ms": 500
        }
      }
    }
  }
}

Field	Default	Purpose
`threshold`	`0.5`	Speech-probability threshold above which a VAD frame counts as speech
`prefix_padding_ms`	`300`	Shifts the reported `audio_start_ms` earlier; does not gate transcription
`silence_duration_ms`	`500`	Sub-threshold audio after speech before the turn is considered finished

To use a different VAD model, launch the server with --vad-model <id> (or set MLX_AUDIO_VAD_MODEL). semantic_vad is not implemented yet — requesting it returns an error event.

Server → client events:

Type	Purpose
`session.created` / `session.updated`	Session snapshot (id, model, input format, turn detection)
`conversation.item.added`	New user item opened for the next audio chunk
`input_audio_buffer.speech_started`	(`server_vad`) Speaker started; carries `audio_start_ms`
`input_audio_buffer.speech_stopped`	(`server_vad`) Speaker stopped; the server auto-commits the turn
`input_audio_buffer.committed`	Acknowledges a commit — client `commit` or `server_vad` auto-commit
`conversation.item.input_audio_transcription.delta`	Incremental transcript token(s)
`conversation.item.input_audio_transcription.completed`	Final transcript for the item
`error`	Error message; the socket may close afterwards

Minimal Python client:

import asyncio, base64, json, websockets, numpy as np, soundfile as sf

async def transcribe(path: str):
    audio, sr = sf.read(path, dtype="int16", always_2d=False)
    uri = "ws://localhost:8000/v1/realtime?model=iris-sfg/Voxtral-Mini-4B-Realtime-2602-4bit"
    async with websockets.connect(uri) as ws:
        # Wait for session.created
        await ws.recv()
        # Declare the input rate (the server will resample to the model's rate)
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {"audio": {"input": {"format": {"type": "audio/pcm", "rate": sr}}}},
        }))
        # Stream audio in 100 ms chunks
        step = sr // 10
        for i in range(0, len(audio), step):
            chunk = audio[i : i + step].tobytes()
            await ws.send(json.dumps({
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(chunk).decode(),
            }))
        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        async for raw in ws:
            evt = json.loads(raw)
            if evt["type"].endswith(".delta"):
                print(evt["delta"], end="", flush=True)
            elif evt["type"].endswith(".completed"):
                print()
                break

asyncio.run(transcribe("audio.wav"))

Tuning transcription delay (for models like voxtral_realtime that expose the knob):

mlx_audio.server \
  --realtime-model iris-sfg/Voxtral-Mini-4B-Realtime-2602-4bit \
  --realtime-transcription-delay-ms 960

Lower values reduce latency at the cost of accuracy. Models that don't declare a transcription_delay_ms parameter silently ignore the flag.

Single-process MLX serialization

MLX inference on a given device is serialized inside the server. /v1/realtime schedules session.step() calls cooperatively — two concurrent websocket clients share GPU time rather than running in parallel. This is fine as long as each stream transcribes faster than real-time; otherwise run multiple server instances behind a load balancer.

Web Interface (Studio UI)¶

The Studio UI is a Next.js application located in mlx_audio/ui/. It provides:

A text input for TTS generation with voice and model selectors
Audio upload and recording for STT
3D audio visualization
Model loading and management

Running the UI Separately¶

If you prefer to run the UI outside the server process:

cd mlx_audio/ui
npm install
npm run dev

The UI runs on http://localhost:3000 and expects the API server at http://localhost:8000.

Using with the OpenAI Python Client¶

Because the API is OpenAI-compatible, you can use the official OpenAI Python client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # MLX Audio does not require an API key
)

# Text-to-Speech
response = client.audio.speech.create(
    model="mlx-community/Kokoro-82M-bf16",
    voice="af_heart",
    input="Hello from the OpenAI client!",
)
response.stream_to_file("output.mp3")

# Speech-to-Text
with open("audio.wav", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="mlx-community/whisper-large-v3-turbo-asr-fp16",
        file=f,
    )
print(transcript.text)

Installation¶

The server requires the server optional dependency group:

pip install "mlx-audio[server]"

This installs FastAPI, Uvicorn, and python-multipart.