Kimi-Audio by MoonshotAI

Audio foundation model for understanding, generation, and conversation

Created 8 months ago

4,441 stars

Top 11.0% on SourcePulse

View on GitHub

2 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Benjamin Bolte

Cofounder of K-Scale Labs

Project Summary

Kimi-Audio is an open-source universal audio foundation model designed for audio understanding, generation, and conversation. It targets researchers and developers working with multimodal AI, offering a unified framework for tasks like speech recognition, audio captioning, and end-to-end speech conversation, with claimed state-of-the-art performance.

How It Works

Kimi-Audio employs a hybrid audio input strategy, converting audio into discrete semantic tokens and continuous acoustic features. These are processed by a transformer-based LLM core (initialized from Qwen 2.5 7B) with parallel heads for text and audio token generation. Audio output is synthesized via a chunk-wise streaming detokenizer using flow matching and a vocoder, enabling low-latency generation. This approach allows a single model to handle diverse audio tasks efficiently.

Quick Start & Requirements

Install via pip install -r requirements.txt after cloning the repository.
Requires PyTorch, Transformers, and other standard ML libraries.
Pretrained model weights are available on Hugging Face.
Official quick-start examples and evaluation toolkit are provided.

Highlighted Details

Achieves SOTA performance on numerous audio benchmarks, including ASR, audio understanding, and audio-to-text chat.
Pre-trained on over 13 million hours of diverse audio and text data.
Features a novel hybrid audio input and a streaming detokenizer for efficient inference.
Released with code, model checkpoints, and a comprehensive evaluation toolkit for reproducibility.

Maintenance & Community

The project is developed by MoonshotAI. Links to Hugging Face model repositories and an evaluation toolkit are provided. Contact is via GitHub issues.

Licensing & Compatibility

Code derived from Qwen2.5-7B is licensed under Apache 2.0. Other code is MIT licensed. This dual licensing generally permits commercial use and linking with closed-source applications.

Limitations & Caveats

The Generation Testset is primarily in Chinese, which may limit its direct applicability for English-only use cases. The model's performance on non-Chinese audio tasks is not explicitly detailed.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

72 stars in the last 30 days