LongCat-Audio-Codec by meituan-longcat

Advanced audio tokenization and detokenization for Speech LLMs

Created 4 months ago

285 stars

Top 91.9% on SourcePulse

Project Summary

This project provides an audio tokenizer and detokenizer solution specifically designed for speech large language models (LLMs). It enables high-fidelity audio reconstruction at extremely low bitrates, offering a low-latency streaming detokenizer and integrated audio super-resolution capabilities, making it beneficial for efficient and high-quality speech processing in LLM backends.

How It Works

The core approach involves generating semantic and acoustic tokens in parallel at a low frame rate of 16.6Hz. This dual-token generation allows for high-intelligibility audio reconstruction even at ultra-low bitrates. The system features a streaming-capable detokenizer that minimizes latency and incorporates audio super-resolution to generate higher-sample-rate audio than the input.

Quick Start & Requirements

Installation: Create a conda environment (python=3.10), activate it, and install PyTorch matching your hardware configuration (e.g., pip install torch==2.7.1 torchaudio==2.7.1), followed by other dependencies (pip install -r requirements.txt).
Model Preparation: Download model checkpoints (encoder, CMVN, decoders) from Huggingface. Place them in the LongCat-Audio-Codec/ckpts/ directory or update the ckpt_path in the corresponding .yaml configuration files.
Run Demo: Execute bash ./run_inference.sh from the project root for a one-click demonstration, with outputs saved in demo_audio_output/. Customization is possible by modifying the script or running inference.py directly with arguments.
Prerequisites: Python 3.10, PyTorch (hardware-specific version), and listed dependencies.

Highlighted Details

Achieves high fidelity and intelligibility at ultra-low bitrates.
Employs a low-frame-rate tokenizer (16.6Hz) generating semantic and acoustic tokens in parallel.
Features a low-latency streaming detokenizer.
Integrates audio super-resolution for generating higher sample rate audio.

Maintenance & Community

The project was released on Oct 17, 2025 (project page) and Oct 20, 2025 (arXiv page). Contact is available via longcat-team@meituan.com or a WeChat Group.

Licensing & Compatibility

The code and models are released under the MIT License, granting broad permissions for use, modification, and distribution. It allows commercial use and linking with closed-source projects. Users are accountable for their usage, which must comply with applicable laws and avoid harmful content.

Limitations & Caveats

The current version supports only single-channel speech and requires input audio to be less than 30 seconds; longer audio must be manually segmented. The LongCatAudioCodec_decoder_24k_2codebooks.pt model was fine-tuned on a limited speaker dataset, potentially leading to degraded reconstruction quality for speakers not present in the training set.

LongCat-Audio-Codec by meituan-longcat

Explore Similar Projects

LinaCodec by ysharma3501

csm-mlx by senstella

PlayDiffusion by playht

VITA-Audio by VITA-MLLM

FireRedTTS by FireRedTeam

soundstorm-pytorch by lucidrains

HierSpeechpp by sh-lee-prml

WhisperSpeech by WhisperSpeech

audiolm-pytorch by lucidrains

Kimi-Audio by MoonshotAI

higgs-audio by boson-ai

VibeVoice by microsoft