Discover and explore top open-source AI tools and projects—updated daily.
meituan-longcatAdvanced audio tokenization and detokenization for Speech LLMs
Top 95.9% on SourcePulse
This project provides an audio tokenizer and detokenizer solution specifically designed for speech large language models (LLMs). It enables high-fidelity audio reconstruction at extremely low bitrates, offering a low-latency streaming detokenizer and integrated audio super-resolution capabilities, making it beneficial for efficient and high-quality speech processing in LLM backends.
How It Works
The core approach involves generating semantic and acoustic tokens in parallel at a low frame rate of 16.6Hz. This dual-token generation allows for high-intelligibility audio reconstruction even at ultra-low bitrates. The system features a streaming-capable detokenizer that minimizes latency and incorporates audio super-resolution to generate higher-sample-rate audio than the input.
Quick Start & Requirements
python=3.10), activate it, and install PyTorch matching your hardware configuration (e.g., pip install torch==2.7.1 torchaudio==2.7.1), followed by other dependencies (pip install -r requirements.txt).LongCat-Audio-Codec/ckpts/ directory or update the ckpt_path in the corresponding .yaml configuration files.bash ./run_inference.sh from the project root for a one-click demonstration, with outputs saved in demo_audio_output/. Customization is possible by modifying the script or running inference.py directly with arguments.Highlighted Details
Maintenance & Community
The project was released on Oct 17, 2025 (project page) and Oct 20, 2025 (arXiv page). Contact is available via longcat-team@meituan.com or a WeChat Group.
Licensing & Compatibility
The code and models are released under the MIT License, granting broad permissions for use, modification, and distribution. It allows commercial use and linking with closed-source projects. Users are accountable for their usage, which must comply with applicable laws and avoid harmful content.
Limitations & Caveats
The current version supports only single-channel speech and requires input audio to be less than 30 seconds; longer audio must be manually segmented. The LongCatAudioCodec_decoder_24k_2codebooks.pt model was fine-tuned on a limited speaker dataset, potentially leading to degraded reconstruction quality for speakers not present in the training set.
3 weeks ago
Inactive
playht
WhisperSpeech
lucidrains