Kimi-Audio  by MoonshotAI

Audio foundation model for understanding, generation, and conversation

created 3 months ago
4,087 stars

Top 12.2% on sourcepulse

GitHubView on GitHub
Project Summary

Kimi-Audio is an open-source universal audio foundation model designed for audio understanding, generation, and conversation. It targets researchers and developers working with multimodal AI, offering a unified framework for tasks like speech recognition, audio captioning, and end-to-end speech conversation, with claimed state-of-the-art performance.

How It Works

Kimi-Audio employs a hybrid audio input strategy, converting audio into discrete semantic tokens and continuous acoustic features. These are processed by a transformer-based LLM core (initialized from Qwen 2.5 7B) with parallel heads for text and audio token generation. Audio output is synthesized via a chunk-wise streaming detokenizer using flow matching and a vocoder, enabling low-latency generation. This approach allows a single model to handle diverse audio tasks efficiently.

Quick Start & Requirements

  • Install via pip install -r requirements.txt after cloning the repository.
  • Requires PyTorch, Transformers, and other standard ML libraries.
  • Pretrained model weights are available on Hugging Face.
  • Official quick-start examples and evaluation toolkit are provided.

Highlighted Details

  • Achieves SOTA performance on numerous audio benchmarks, including ASR, audio understanding, and audio-to-text chat.
  • Pre-trained on over 13 million hours of diverse audio and text data.
  • Features a novel hybrid audio input and a streaming detokenizer for efficient inference.
  • Released with code, model checkpoints, and a comprehensive evaluation toolkit for reproducibility.

Maintenance & Community

The project is developed by MoonshotAI. Links to Hugging Face model repositories and an evaluation toolkit are provided. Contact is via GitHub issues.

Licensing & Compatibility

Code derived from Qwen2.5-7B is licensed under Apache 2.0. Other code is MIT licensed. This dual licensing generally permits commercial use and linking with closed-source applications.

Limitations & Caveats

The Generation Testset is primarily in Chinese, which may limit its direct applicability for English-only use cases. The model's performance on non-Chinese audio tasks is not explicitly detailed.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
11
Star History
849 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

AudioGPT by AIGC-Audio

0.1%
10k
Audio processing and generation research project
created 2 years ago
updated 1 year ago
Feedback? Help us improve.