Discover and explore top open-source AI tools and projects—updated daily.
Audio language models excel at few-shot learning and generalization
New!
Top 45.5% on SourcePulse
MiMo-Audio addresses the limitations of task-specific fine-tuning in audio language models by enabling few-shot learning capabilities, mirroring advancements seen in text-based LLMs. It targets researchers and developers seeking to generalize audio processing tasks with minimal examples. The primary benefit is achieving state-of-the-art performance and novel generalization abilities across diverse audio tasks, including realistic audio continuation.
How It Works
MiMo-Audio scales next-token prediction pretraining on an extensive dataset (over 100 million hours) to foster few-shot learning in audio. Its architecture features a MiMo-Audio-Tokenizer
, a 1.2B-parameter Transformer operating at 25 Hz using an RVQ stack, trained for reconstruction quality. This tokenizer is coupled with a patch encoder that aggregates audio tokens into patches for an LLM, and a patch decoder that autoregressively generates the full token sequence. This approach enhances modeling efficiency for high-rate sequences and bridges length mismatches, enabling emergent few-shot capabilities.
Quick Start & Requirements
git clone https://github.com/XiaomiMiMo/MiMo-Audio.git
), navigate into the directory, install requirements (pip install -r requirements.txt
), and install flash-attn
(pip install flash-attn==2.7.4.post1
). Models can be downloaded using hf download
.python run_mimo_audio.py
.Highlighted Details
Maintenance & Community
The project is associated with the "LLM-Core-Team Xiaomi". Direct community channels like Discord or Slack are not specified, but contact is available via mimo@xiaomi.com
or by opening GitHub issues.
Licensing & Compatibility
The specific open-source license for MiMo-Audio is not explicitly stated in the provided README, which may require further investigation for commercial or closed-source integration.
Limitations & Caveats
The README does not detail specific limitations. However, the project relies on specific hardware and software prerequisites (Python 3.12, CUDA 12.0), and the "2025" date in the citation suggests it may be a recent or evolving project. While capable of few-shot learning, performance on highly specialized tasks might differ from dedicated fine-tuned models.
3 weeks ago
Inactive