Skip to content

TTS API Reference

Model Loading

The primary entry points for loading TTS models.

mlx_audio.tts.utils

Example:

from mlx_audio.tts import load

model = load("mlx-community/outetts-0.3-500M-bf16")
audio = model.generate("Hello world!")

mlx_audio.tts.utils.load(model_path, lazy=False, strict=True, **kwargs)

Load a text-to-speech model from a local path or HuggingFace repository.

This is the main entry point for loading TTS models. It automatically detects the model type and initializes the appropriate model class.

Parameters:

Name Type Description Default
model_path Union[str, Path]

The local path or HuggingFace repo ID to load from.

required
lazy bool

If False, evaluate model parameters immediately.

False
strict bool

If True, raise an error if any weights are missing.

True
**kwargs Any

Additional keyword arguments such as revision and force_download.

{}

Returns:

Type Description
Module

nn.Module: The loaded and initialized model.

mlx_audio.tts.utils.load_model(model_path, lazy=False, strict=True, **kwargs)

Load and initialize the model from a given path.

Parameters:

Name Type Description Default
model_path Path

The path to load the model from.

required
lazy bool

If False eval the model parameters to make sure they are loaded in memory before returning, otherwise they will be loaded when needed. Default: False

False

Returns:

Type Description
Module

nn.Module: The loaded and initialized model.

Raises:

Type Description
FileNotFoundError

If the weight files (.safetensors) are not found.

ValueError

If the model class or args class are not found or cannot be instantiated.

mlx_audio.tts.utils.get_available_models()

Get a list of all available TTS model types by scanning the models directory.

Returns:

Type Description
List[str]

List[str]: A list of available model type names

mlx_audio.tts.utils.get_model_and_args(model_type, model_name)

Retrieve the model architecture module based on the model type and name.

This function attempts to find the appropriate model architecture by: 1. Checking if the model_type is directly in the MODEL_REMAPPING dictionary 2. Looking for partial matches in segments of the model_name

Parameters:

Name Type Description Default
model_type str

The type of model to load (e.g., "outetts").

required
model_name List[str]

List of model name components that might contain remapping information.

required

Returns:

Type Description
Tuple[Any, str]

Tuple[module, str]: A tuple containing: - The imported architecture module - The resolved model_type string after remapping

Raises:

Type Description
ValueError

If the model type is not supported (module import fails).

mlx_audio.tts.utils.fetch_from_hub(model_path, lazy=False, **kwargs)

mlx_audio.tts.utils.upload_to_hub(path, upload_repo, hf_path)

Uploads the model to Hugging Face hub.

Parameters:

Name Type Description Default
path str

Local path to the model.

required
upload_repo str

Name of the HF repo to upload to.

required
hf_path str

Path to the original Hugging Face model.

required

Audio Generation

mlx_audio.tts.generate

mlx_audio.tts.generate.generate_audio(text, model=None, max_tokens=1200, voice='af_heart', prompt=None, instruct=None, speed=1.0, lang_code='en', cfg_scale=None, ddpm_steps=None, sigma=None, ref_audio=None, ref_text=None, stt_model='mlx-community/whisper-large-v3-turbo-asr-fp16', output_path=None, file_prefix='audio', audio_format='wav', join_audio=False, play=False, verbose=True, temperature=0.7, stream=False, streaming_interval=2.0, save=False, use_zero_spk_emb=False, **kwargs)

Generates audio from text using a specified TTS model.

Parameters: - text (str): The input text to be converted to speech. - model (str): The TTS model to use. - voice (str): The voice style to use (also used as speaker for Qwen3-TTS models). - instruct (str): Instruction for emotion/style (CustomVoice) or voice description (VoiceDesign). - temperature (float): The temperature for the model. - speed (float): Playback speed multiplier. - lang_code (str): The language code. - ref_audio (mx.array): Reference audio you would like to clone the voice from. - ref_text (str): Caption for reference audio. - stt_model_path (str): A mlx whisper model to use to transcribe. - output_path (str): Directory path where audio files will be saved. - file_prefix (str): The output file path without extension. - audio_format (str): Output audio format (e.g., "wav", "flac"). - join_audio (bool): Whether to join multiple audio files into one. - play (bool): Whether to play the generated audio. - verbose (bool): Whether to print status messages. - save (bool): Whether to save streamed audio to a file when using stream mode. - model (object): A already loaded model. - stt_model (object): A already loaded stt model. Returns: - None: The function writes the generated audio to a file when not streaming, or when streaming with saving enabled.

Data Classes

mlx_audio.tts.models.base

mlx_audio.tts.models.base.GenerationResult dataclass

audio instance-attribute
samples instance-attribute
sample_rate instance-attribute
segment_idx instance-attribute
token_count instance-attribute
audio_duration instance-attribute
real_time_factor instance-attribute
processing_time_seconds instance-attribute
peak_memory_usage instance-attribute
is_streaming_chunk = False class-attribute instance-attribute
is_final_chunk = False class-attribute instance-attribute

mlx_audio.tts.models.base.BatchGenerationResult dataclass

mlx_audio.tts.models.base.BaseModelArgs dataclass