Kimi-Audio  by MoonshotAI

Audio foundation model for understanding, generation, and conversation

Created 4 months ago
4,237 stars

Top 11.6% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Kimi-Audio is an open-source universal audio foundation model designed for audio understanding, generation, and conversation. It targets researchers and developers working with multimodal AI, offering a unified framework for tasks like speech recognition, audio captioning, and end-to-end speech conversation, with claimed state-of-the-art performance.

How It Works

Kimi-Audio employs a hybrid audio input strategy, converting audio into discrete semantic tokens and continuous acoustic features. These are processed by a transformer-based LLM core (initialized from Qwen 2.5 7B) with parallel heads for text and audio token generation. Audio output is synthesized via a chunk-wise streaming detokenizer using flow matching and a vocoder, enabling low-latency generation. This approach allows a single model to handle diverse audio tasks efficiently.

Quick Start & Requirements

  • Install via pip install -r requirements.txt after cloning the repository.
  • Requires PyTorch, Transformers, and other standard ML libraries.
  • Pretrained model weights are available on Hugging Face.
  • Official quick-start examples and evaluation toolkit are provided.

Highlighted Details

  • Achieves SOTA performance on numerous audio benchmarks, including ASR, audio understanding, and audio-to-text chat.
  • Pre-trained on over 13 million hours of diverse audio and text data.
  • Features a novel hybrid audio input and a streaming detokenizer for efficient inference.
  • Released with code, model checkpoints, and a comprehensive evaluation toolkit for reproducibility.

Maintenance & Community

The project is developed by MoonshotAI. Links to Hugging Face model repositories and an evaluation toolkit are provided. Contact is via GitHub issues.

Licensing & Compatibility

Code derived from Qwen2.5-7B is licensed under Apache 2.0. Other code is MIT licensed. This dual licensing generally permits commercial use and linking with closed-source applications.

Limitations & Caveats

The Generation Testset is primarily in Chinese, which may limit its direct applicability for English-only use cases. The model's performance on non-Chinese audio tasks is not explicitly detailed.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
6
Star History
80 stars in the last 30 days

Explore Similar Projects

Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Jinze Bai Jinze Bai(Research Scientist at Alibaba Qwen), and
1 more.

Qwen-Audio by QwenLM

0.4%
2k
Audio-language model for audio understanding and chat
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.