STT API Reference¶
Model Loading¶
mlx_audio.stt.utils¶
Example:
from mlx_audio.stt import load
model = load("mlx-community/whisper-tiny-asr-fp16")
result = model.generate(audio)
mlx_audio.stt.utils.load(model_path, lazy=False, strict=False, **kwargs)
¶
Load a speech-to-text model from a local path or HuggingFace repository.
This is the main entry point for loading STT models. It automatically detects the model type and initializes the appropriate model class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_path
|
Union[str, Path]
|
The local path or HuggingFace repo ID to load from. |
required |
lazy
|
bool
|
If False, evaluate model parameters immediately. |
False
|
strict
|
bool
|
If True, raise an error if any weights are missing. |
False
|
**kwargs
|
Any
|
Additional keyword arguments such as |
{}
|
Returns:
| Type | Description |
|---|---|
Module
|
nn.Module: The loaded and initialized model. |
mlx_audio.stt.utils.load_model(model_path, lazy=False, strict=False, **kwargs)
¶
Load and initialize an STT model from a given path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_path
|
Union[str, Path]
|
The path or HuggingFace repo to load the model from. |
required |
lazy
|
bool
|
If False, evaluate model parameters immediately. |
False
|
strict
|
bool
|
If True, raise an error if any weights are missing. |
False
|
**kwargs
|
Any
|
Additional keyword arguments (revision, force_download). |
{}
|
Returns:
| Type | Description |
|---|---|
Module
|
nn.Module: The loaded and initialized model. |
mlx_audio.stt.utils.load_audio(file=Optional[str], sr=SAMPLE_RATE, from_stdin=False, dtype=mx.float32)
¶
mlx_audio.stt.utils.resample_audio(audio, orig_sr, target_sr)
¶
Transcription CLI¶
The mlx_audio.stt.generate module provides a command-line interface for transcription:
python -m mlx_audio.stt.generate \
--model mlx-community/whisper-large-v3-turbo-asr-fp16 \
--audio speech.wav \
--output-path output \
--format json \
--verbose
CLI Arguments¶
| Argument | Default | Description |
|---|---|---|
--model |
whisper-large-v3-turbo |
Model path or HuggingFace repo |
--audio |
required | Path to the audio file |
--output-path |
required | Directory to save output |
--format |
txt |
Output format: txt, srt, vtt, json |
--language |
en |
Language code |
--max-tokens |
8192 |
Maximum output tokens |
--chunk-duration |
30.0 |
Chunk duration in seconds |
--stream |
false |
Stream transcription output |
--context |
null |
Hotwords or metadata string |
--prefill-step-size |
2048 |
Prefill step size |
--verbose |
false |
Print detailed output |