MiMo-Audio by XiaomiMiMo

Audio language models excel at few-shot learning and generalization

Created 2 months ago

866 stars

Top 41.4% on SourcePulse

Project Summary

MiMo-Audio addresses the limitations of task-specific fine-tuning in audio language models by enabling few-shot learning capabilities, mirroring advancements seen in text-based LLMs. It targets researchers and developers seeking to generalize audio processing tasks with minimal examples. The primary benefit is achieving state-of-the-art performance and novel generalization abilities across diverse audio tasks, including realistic audio continuation.

How It Works

MiMo-Audio scales next-token prediction pretraining on an extensive dataset (over 100 million hours) to foster few-shot learning in audio. Its architecture features a MiMo-Audio-Tokenizer, a 1.2B-parameter Transformer operating at 25 Hz using an RVQ stack, trained for reconstruction quality. This tokenizer is coupled with a patch encoder that aggregates audio tokens into patches for an LLM, and a patch decoder that autoregressively generates the full token sequence. This approach enhances modeling efficiency for high-rate sequences and bridges length mismatches, enabling emergent few-shot capabilities.

Quick Start & Requirements

Installation: Clone the repository (git clone https://github.com/XiaomiMiMo/MiMo-Audio.git), navigate into the directory, install requirements (pip install -r requirements.txt), and install flash-attn (pip install flash-attn==2.7.4.post1). Models can be downloaded using hf download.
Prerequisites: Python 3.12, CUDA >= 12.0, and Linux are required.
Demo: A Gradio demo can be launched via python run_mimo_audio.py.
Resources: Links to the official blog, paper, and Hugging Face models are available.

Highlighted Details

MiMo-Audio-7B-Base achieves state-of-the-art performance on speech intelligence and audio understanding benchmarks among open-source models.
Demonstrates generalization to tasks not present in its training data, such as voice conversion and style transfer.
Exhibits strong speech continuation capabilities, generating realistic talk shows and debates.
MiMo-Audio-7B-Instruct approaches or surpasses closed-source models on spoken dialogue and instruct-TTS evaluations.

Maintenance & Community

The project is associated with the "LLM-Core-Team Xiaomi". Direct community channels like Discord or Slack are not specified, but contact is available via mimo@xiaomi.com or by opening GitHub issues.

Licensing & Compatibility

The specific open-source license for MiMo-Audio is not explicitly stated in the provided README, which may require further investigation for commercial or closed-source integration.

Limitations & Caveats

The README does not detail specific limitations. However, the project relies on specific hardware and software prerequisites (Python 3.12, CUDA 12.0), and the "2025" date in the citation suggests it may be a recent or evolving project. While capable of few-shot learning, performance on highly specialized tasks might differ from dedicated fine-tuned models.

MiMo-Audio by XiaomiMiMo

Explore Similar Projects

MGM-Omni by dvlab-research

AudioStory by TencentARC

awesome-audio-plaza by metame-ai

Large-Audio-Models by liusongxiang

SpeechGPT-2.0-preview by OpenMOSS

dasheng-lm by xiaomi-research

VITA-Audio by VITA-MLLM

HierSpeechpp by sh-lee-prml

audiolm-pytorch by lucidrains

Kimi-Audio by MoonshotAI

higgs-audio by boson-ai

StyleTTS2 by yl4579