TTS API Reference¶
Model Loading¶
The primary entry points for loading TTS models.
mlx_audio.tts.utils¶
Example:
from mlx_audio.tts import load
model = load("mlx-community/outetts-0.3-500M-bf16")
audio = model.generate("Hello world!")
mlx_audio.tts.utils.load(model_path, lazy=False, strict=True, **kwargs)
¶
Load a text-to-speech model from a local path or HuggingFace repository.
This is the main entry point for loading TTS models. It automatically detects the model type and initializes the appropriate model class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_path
|
Union[str, Path]
|
The local path or HuggingFace repo ID to load from. |
required |
lazy
|
bool
|
If False, evaluate model parameters immediately. |
False
|
strict
|
bool
|
If True, raise an error if any weights are missing. |
True
|
**kwargs
|
Any
|
Additional keyword arguments such as |
{}
|
Returns:
| Type | Description |
|---|---|
Module
|
nn.Module: The loaded and initialized model. |
mlx_audio.tts.utils.load_model(model_path, lazy=False, strict=True, **kwargs)
¶
Load and initialize the model from a given path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_path
|
Path
|
The path to load the model from. |
required |
lazy
|
bool
|
If False eval the model parameters to make sure they are
loaded in memory before returning, otherwise they will be loaded
when needed. Default: |
False
|
Returns:
| Type | Description |
|---|---|
Module
|
nn.Module: The loaded and initialized model. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the weight files (.safetensors) are not found. |
ValueError
|
If the model class or args class are not found or cannot be instantiated. |
mlx_audio.tts.utils.get_available_models()
¶
Get a list of all available TTS model types by scanning the models directory.
Returns:
| Type | Description |
|---|---|
List[str]
|
List[str]: A list of available model type names |
mlx_audio.tts.utils.get_model_and_args(model_type, model_name)
¶
Retrieve the model architecture module based on the model type and name.
This function attempts to find the appropriate model architecture by: 1. Checking if the model_type is directly in the MODEL_REMAPPING dictionary 2. Looking for partial matches in segments of the model_name
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_type
|
str
|
The type of model to load (e.g., "outetts"). |
required |
model_name
|
List[str]
|
List of model name components that might contain remapping information. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[Any, str]
|
Tuple[module, str]: A tuple containing: - The imported architecture module - The resolved model_type string after remapping |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the model type is not supported (module import fails). |
mlx_audio.tts.utils.fetch_from_hub(model_path, lazy=False, **kwargs)
¶
mlx_audio.tts.utils.upload_to_hub(path, upload_repo, hf_path)
¶
Uploads the model to Hugging Face hub.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Local path to the model. |
required |
upload_repo
|
str
|
Name of the HF repo to upload to. |
required |
hf_path
|
str
|
Path to the original Hugging Face model. |
required |
Audio Generation¶
mlx_audio.tts.generate¶
mlx_audio.tts.generate.generate_audio(text, model=None, max_tokens=1200, voice='af_heart', prompt=None, instruct=None, speed=1.0, lang_code='en', cfg_scale=None, ddpm_steps=None, sigma=None, ref_audio=None, ref_text=None, stt_model='mlx-community/whisper-large-v3-turbo-asr-fp16', output_path=None, file_prefix='audio', audio_format='wav', join_audio=False, play=False, verbose=True, temperature=0.7, stream=False, streaming_interval=2.0, save=False, use_zero_spk_emb=False, **kwargs)
¶
Generates audio from text using a specified TTS model.
Parameters: - text (str): The input text to be converted to speech. - model (str): The TTS model to use. - voice (str): The voice style to use (also used as speaker for Qwen3-TTS models). - instruct (str): Instruction for emotion/style (CustomVoice) or voice description (VoiceDesign). - temperature (float): The temperature for the model. - speed (float): Playback speed multiplier. - lang_code (str): The language code. - ref_audio (mx.array): Reference audio you would like to clone the voice from. - ref_text (str): Caption for reference audio. - stt_model_path (str): A mlx whisper model to use to transcribe. - output_path (str): Directory path where audio files will be saved. - file_prefix (str): The output file path without extension. - audio_format (str): Output audio format (e.g., "wav", "flac"). - join_audio (bool): Whether to join multiple audio files into one. - play (bool): Whether to play the generated audio. - verbose (bool): Whether to print status messages. - save (bool): Whether to save streamed audio to a file when using stream mode. - model (object): A already loaded model. - stt_model (object): A already loaded stt model. Returns: - None: The function writes the generated audio to a file when not streaming, or when streaming with saving enabled.