MiMo-Audio  by XiaomiMiMo

Audio language models excel at few-shot learning and generalization

Created 3 weeks ago

New!

767 stars

Top 45.5% on SourcePulse

GitHubView on GitHub
Project Summary

MiMo-Audio addresses the limitations of task-specific fine-tuning in audio language models by enabling few-shot learning capabilities, mirroring advancements seen in text-based LLMs. It targets researchers and developers seeking to generalize audio processing tasks with minimal examples. The primary benefit is achieving state-of-the-art performance and novel generalization abilities across diverse audio tasks, including realistic audio continuation.

How It Works

MiMo-Audio scales next-token prediction pretraining on an extensive dataset (over 100 million hours) to foster few-shot learning in audio. Its architecture features a MiMo-Audio-Tokenizer, a 1.2B-parameter Transformer operating at 25 Hz using an RVQ stack, trained for reconstruction quality. This tokenizer is coupled with a patch encoder that aggregates audio tokens into patches for an LLM, and a patch decoder that autoregressively generates the full token sequence. This approach enhances modeling efficiency for high-rate sequences and bridges length mismatches, enabling emergent few-shot capabilities.

Quick Start & Requirements

  • Installation: Clone the repository (git clone https://github.com/XiaomiMiMo/MiMo-Audio.git), navigate into the directory, install requirements (pip install -r requirements.txt), and install flash-attn (pip install flash-attn==2.7.4.post1). Models can be downloaded using hf download.
  • Prerequisites: Python 3.12, CUDA >= 12.0, and Linux are required.
  • Demo: A Gradio demo can be launched via python run_mimo_audio.py.
  • Resources: Links to the official blog, paper, and Hugging Face models are available.

Highlighted Details

  • MiMo-Audio-7B-Base achieves state-of-the-art performance on speech intelligence and audio understanding benchmarks among open-source models.
  • Demonstrates generalization to tasks not present in its training data, such as voice conversion and style transfer.
  • Exhibits strong speech continuation capabilities, generating realistic talk shows and debates.
  • MiMo-Audio-7B-Instruct approaches or surpasses closed-source models on spoken dialogue and instruct-TTS evaluations.

Maintenance & Community

The project is associated with the "LLM-Core-Team Xiaomi". Direct community channels like Discord or Slack are not specified, but contact is available via mimo@xiaomi.com or by opening GitHub issues.

Licensing & Compatibility

The specific open-source license for MiMo-Audio is not explicitly stated in the provided README, which may require further investigation for commercial or closed-source integration.

Limitations & Caveats

The README does not detail specific limitations. However, the project relies on specific hardware and software prerequisites (Python 3.12, CUDA 12.0), and the "2025" date in the citation suggests it may be a recent or evolving project. While capable of few-shot learning, performance on highly specialized tasks might differ from dedicated fine-tuned models.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
39
Star History
773 stars in the last 26 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

StyleTTS2 by yl4579

0.1%
6k
Text-to-speech model achieving human-level synthesis
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.