Kimi-Audio  by MoonshotAI

Audio foundation model for understanding, generation, and conversation

Created 1 year ago
4,642 stars

Top 10.5% on SourcePulse

GitHubView on GitHub
Project Summary

Kimi-Audio is an open-source universal audio foundation model designed for audio understanding, generation, and conversation. It targets researchers and developers working with multimodal AI, offering a unified framework for tasks like speech recognition, audio captioning, and end-to-end speech conversation, with claimed state-of-the-art performance.

How It Works

Kimi-Audio employs a hybrid audio input strategy, converting audio into discrete semantic tokens and continuous acoustic features. These are processed by a transformer-based LLM core (initialized from Qwen 2.5 7B) with parallel heads for text and audio token generation. Audio output is synthesized via a chunk-wise streaming detokenizer using flow matching and a vocoder, enabling low-latency generation. This approach allows a single model to handle diverse audio tasks efficiently.

Quick Start & Requirements

  • Install via pip install -r requirements.txt after cloning the repository.
  • Requires PyTorch, Transformers, and other standard ML libraries.
  • Pretrained model weights are available on Hugging Face.
  • Official quick-start examples and evaluation toolkit are provided.

Highlighted Details

  • Achieves SOTA performance on numerous audio benchmarks, including ASR, audio understanding, and audio-to-text chat.
  • Pre-trained on over 13 million hours of diverse audio and text data.
  • Features a novel hybrid audio input and a streaming detokenizer for efficient inference.
  • Released with code, model checkpoints, and a comprehensive evaluation toolkit for reproducibility.

Maintenance & Community

The project is developed by MoonshotAI. Links to Hugging Face model repositories and an evaluation toolkit are provided. Contact is via GitHub issues.

Licensing & Compatibility

Code derived from Qwen2.5-7B is licensed under Apache 2.0. Other code is MIT licensed. This dual licensing generally permits commercial use and linking with closed-source applications.

Limitations & Caveats

The Generation Testset is primarily in Chinese, which may limit its direct applicability for English-only use cases. The model's performance on non-Chinese audio tasks is not explicitly detailed.

Health Check
Last Commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
62 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.