Skip to content

STT API Reference

Model Loading

mlx_audio.stt.utils

Example:

from mlx_audio.stt import load

model = load("mlx-community/whisper-tiny-asr-fp16")
result = model.generate(audio)

mlx_audio.stt.utils.load(model_path, lazy=False, strict=False, **kwargs)

Load a speech-to-text model from a local path or HuggingFace repository.

This is the main entry point for loading STT models. It automatically detects the model type and initializes the appropriate model class.

Parameters:

Name Type Description Default
model_path Union[str, Path]

The local path or HuggingFace repo ID to load from.

required
lazy bool

If False, evaluate model parameters immediately.

False
strict bool

If True, raise an error if any weights are missing.

False
**kwargs Any

Additional keyword arguments such as revision and force_download.

{}

Returns:

Type Description
Module

nn.Module: The loaded and initialized model.

mlx_audio.stt.utils.load_model(model_path, lazy=False, strict=False, **kwargs)

Load and initialize an STT model from a given path.

Parameters:

Name Type Description Default
model_path Union[str, Path]

The path or HuggingFace repo to load the model from.

required
lazy bool

If False, evaluate model parameters immediately.

False
strict bool

If True, raise an error if any weights are missing.

False
**kwargs Any

Additional keyword arguments (revision, force_download).

{}

Returns:

Type Description
Module

nn.Module: The loaded and initialized model.

mlx_audio.stt.utils.load_audio(file=Optional[str], sr=SAMPLE_RATE, from_stdin=False, dtype=mx.float32)

Open an audio file and read as mono waveform, resampling as necessary

Parameters

file: str The audio file to open

int

The sample rate to resample the audio if necessary

Returns

A NumPy array containing the audio waveform, in float32 dtype.

mlx_audio.stt.utils.resample_audio(audio, orig_sr, target_sr)

Transcription CLI

The mlx_audio.stt.generate module provides a command-line interface for transcription:

python -m mlx_audio.stt.generate \
    --model mlx-community/whisper-large-v3-turbo-asr-fp16 \
    --audio speech.wav \
    --output-path output \
    --format json \
    --verbose

CLI Arguments

Argument Default Description
--model whisper-large-v3-turbo Model path or HuggingFace repo
--audio required Path to the audio file
--output-path required Directory to save output
--format txt Output format: txt, srt, vtt, json
--language en Language code
--max-tokens 8192 Maximum output tokens
--chunk-duration 30.0 Chunk duration in seconds
--stream false Stream transcription output
--context null Hotwords or metadata string
--prefill-step-size 2048 Prefill step size
--verbose false Print detailed output