Audio foundation model for understanding, generation, and conversation
Top 12.2% on sourcepulse
Kimi-Audio is an open-source universal audio foundation model designed for audio understanding, generation, and conversation. It targets researchers and developers working with multimodal AI, offering a unified framework for tasks like speech recognition, audio captioning, and end-to-end speech conversation, with claimed state-of-the-art performance.
How It Works
Kimi-Audio employs a hybrid audio input strategy, converting audio into discrete semantic tokens and continuous acoustic features. These are processed by a transformer-based LLM core (initialized from Qwen 2.5 7B) with parallel heads for text and audio token generation. Audio output is synthesized via a chunk-wise streaming detokenizer using flow matching and a vocoder, enabling low-latency generation. This approach allows a single model to handle diverse audio tasks efficiently.
Quick Start & Requirements
pip install -r requirements.txt
after cloning the repository.Highlighted Details
Maintenance & Community
The project is developed by MoonshotAI. Links to Hugging Face model repositories and an evaluation toolkit are provided. Contact is via GitHub issues.
Licensing & Compatibility
Code derived from Qwen2.5-7B is licensed under Apache 2.0. Other code is MIT licensed. This dual licensing generally permits commercial use and linking with closed-source applications.
Limitations & Caveats
The Generation Testset is primarily in Chinese, which may limit its direct applicability for English-only use cases. The model's performance on non-Chinese audio tasks is not explicitly detailed.
1 month ago
Inactive