STT API Reference¶

Model Loading¶

`mlx_audio.stt.utils`¶

Example:

from mlx_audio.stt import load

model = load("mlx-community/whisper-tiny-asr-fp16")
result = model.generate(audio)

`mlx_audio.stt.utils.load(model_path, lazy=False, strict=False, **kwargs)` ¶

Load a speech-to-text model from a local path or HuggingFace repository.

This is the main entry point for loading STT models. It automatically detects the model type and initializes the appropriate model class.

Parameters:

Name	Type	Description	Default
`model_path`	`Union[str, Path]`	The local path or HuggingFace repo ID to load from.	required
`lazy`	`bool`	If False, evaluate model parameters immediately.	`False`
`strict`	`bool`	If True, raise an error if any weights are missing.	`False`
`**kwargs`	`Any`	Additional keyword arguments such as `revision` and `force_download`.	`{}`

Returns:

Type	Description
`Module`	nn.Module: The loaded and initialized model.

`mlx_audio.stt.utils.load_model(model_path, lazy=False, strict=False, **kwargs)` ¶

Load and initialize an STT model from a given path.

Parameters:

Name	Type	Description	Default
`model_path`	`Union[str, Path]`	The path or HuggingFace repo to load the model from.	required
`lazy`	`bool`	If False, evaluate model parameters immediately.	`False`
`strict`	`bool`	If True, raise an error if any weights are missing.	`False`
`**kwargs`	`Any`	Additional keyword arguments (revision, force_download).	`{}`

Returns:

Type	Description
`Module`	nn.Module: The loaded and initialized model.

`mlx_audio.stt.utils.load_audio(file=Optional[str], sr=SAMPLE_RATE, from_stdin=False, dtype=mx.float32)` ¶

Open an audio file and read as mono waveform, resampling as necessary

Parameters¶

file: str The audio file to open

int

The sample rate to resample the audio if necessary

Returns¶

A NumPy array containing the audio waveform, in float32 dtype.

`mlx_audio.stt.utils.resample_audio(audio, orig_sr, target_sr)` ¶

Transcription CLI¶

The mlx_audio.stt.generate module provides a command-line interface for transcription:

python -m mlx_audio.stt.generate \
    --model mlx-community/whisper-large-v3-turbo-asr-fp16 \
    --audio speech.wav \
    --output-path output \
    --format json \
    --verbose

CLI Arguments¶

Argument	Default	Description
`--model`	`whisper-large-v3-turbo`	Model path or HuggingFace repo
`--audio`	required	Path to the audio file
`--output-path`	required	Directory to save output
`--format`	`txt`	Output format: `txt`, `srt`, `vtt`, `json`
`--language`	`en`	Language code
`--max-tokens`	`8192`	Maximum output tokens
`--max-parallel-segments`	`null`	Maximum audio segments to transcribe in parallel for models that support segment batching
`--chunk-duration`	`30.0`	Chunk duration in seconds
`--stream`	`false`	Stream transcription output
`--context`	`null`	Hotwords or metadata string
`--prefill-step-size`	`2048`	Prefill step size
`--verbose`	`false`	Print detailed output

STT API Reference¶

Model Loading¶

mlx_audio.stt.utils¶

mlx_audio.stt.utils.load(model_path, lazy=False, strict=False, **kwargs) ¶

mlx_audio.stt.utils.load_model(model_path, lazy=False, strict=False, **kwargs) ¶

mlx_audio.stt.utils.load_audio(file=Optional[str], sr=SAMPLE_RATE, from_stdin=False, dtype=mx.float32) ¶