Skip to content

MLX Audio

Qwen2-Audio

Blaizzy/mlx-audio

Qwen2-Audio¶

Qwen2-Audio is a multimodal audio-language model that handles more than transcription. In MLX Audio it can be used for ASR, translation, captioning, emotion recognition, and general audio understanding through text prompts.

Models¶

Model	Quantization	HuggingFace
`Qwen/Qwen2-Audio-7B-Instruct`	bf16	Model Card
`mlx-community/Qwen2-Audio-7B-Instruct-4bit`	4-bit	Model Card

Usage¶

CLIPython

python -m mlx_audio.stt.generate \
    --model mlx-community/Qwen2-Audio-7B-Instruct-4bit \
    --audio audio.wav \
    --prompt "Transcribe the audio."

from mlx_audio.stt.utils import load_model

model = load_model("mlx-community/Qwen2-Audio-7B-Instruct-4bit")

result = model.generate("audio.wav", prompt="Transcribe the audio.")
print(result.text)

Typical Prompts¶

Transcribe the audio.
Translate the speech to French.
What emotion is the speaker expressing?
Describe the environmental sounds in this clip.

Capabilities¶

Speech transcription
Speech translation
Audio captioning
Emotion and sentiment analysis
Environmental sound classification
General audio understanding

Links¶