LongCat-Audio-Codec  by meituan-longcat

Advanced audio tokenization and detokenization for Speech LLMs

Created 2 months ago
268 stars

Top 95.9% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides an audio tokenizer and detokenizer solution specifically designed for speech large language models (LLMs). It enables high-fidelity audio reconstruction at extremely low bitrates, offering a low-latency streaming detokenizer and integrated audio super-resolution capabilities, making it beneficial for efficient and high-quality speech processing in LLM backends.

How It Works

The core approach involves generating semantic and acoustic tokens in parallel at a low frame rate of 16.6Hz. This dual-token generation allows for high-intelligibility audio reconstruction even at ultra-low bitrates. The system features a streaming-capable detokenizer that minimizes latency and incorporates audio super-resolution to generate higher-sample-rate audio than the input.

Quick Start & Requirements

  • Installation: Create a conda environment (python=3.10), activate it, and install PyTorch matching your hardware configuration (e.g., pip install torch==2.7.1 torchaudio==2.7.1), followed by other dependencies (pip install -r requirements.txt).
  • Model Preparation: Download model checkpoints (encoder, CMVN, decoders) from Huggingface. Place them in the LongCat-Audio-Codec/ckpts/ directory or update the ckpt_path in the corresponding .yaml configuration files.
  • Run Demo: Execute bash ./run_inference.sh from the project root for a one-click demonstration, with outputs saved in demo_audio_output/. Customization is possible by modifying the script or running inference.py directly with arguments.
  • Prerequisites: Python 3.10, PyTorch (hardware-specific version), and listed dependencies.

Highlighted Details

  • Achieves high fidelity and intelligibility at ultra-low bitrates.
  • Employs a low-frame-rate tokenizer (16.6Hz) generating semantic and acoustic tokens in parallel.
  • Features a low-latency streaming detokenizer.
  • Integrates audio super-resolution for generating higher sample rate audio.

Maintenance & Community

The project was released on Oct 17, 2025 (project page) and Oct 20, 2025 (arXiv page). Contact is available via longcat-team@meituan.com or a WeChat Group.

Licensing & Compatibility

The code and models are released under the MIT License, granting broad permissions for use, modification, and distribution. It allows commercial use and linking with closed-source projects. Users are accountable for their usage, which must comply with applicable laws and avoid harmful content.

Limitations & Caveats

The current version supports only single-channel speech and requires input audio to be less than 30 seconds; longer audio must be manually segmented. The LongCatAudioCodec_decoder_24k_2codebooks.pt model was fine-tuned on a limited speaker dataset, potentially leading to degraded reconstruction quality for speakers not present in the training set.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.