xcodec by zhenye234

Unified semantic and acoustic codec for advanced audio language modeling

Created 1 year ago

284 stars

Top 92.1% on SourcePulse

Project Summary

The zhenye234/xcodec project introduces a unified semantic and acoustic codec designed to address the semantic shortcomings of traditional audio codecs in audio language models. It offers a novel approach to enhance existing audio processing pipelines, benefiting researchers and practitioners in speech synthesis, audio generation, and natural language processing for audio.

How It Works

X-Codec integrates semantic feature extraction, typically from models like Hubert or WavLM, directly into the audio codec pipeline. The architecture combines acoustic encoder outputs with semantic model embeddings, processes them through projector layers, and then quantizes the unified representation using Residual Vector Quantization (RVQ). This allows the codec to capture richer semantic information, leading to improved audio quality and model performance. The approach is demonstrated with a `Codec` class incorporating acoustic and semantic modules, quantization, and decoding steps.

Quick Start & Requirements

Inference: Download models and configuration from Hugging Face, then execute python inference.py.
Training: Prepare training and validation files listing audio file paths. Initiate training via torchrun --nnodes=1 --nproc-per-node=8 main_launch_vqdp.py.
Prerequisites: Requires PyTorch, Hugging Face Transformers, and specific pre-trained models (e.g., Hubert, WavLM). The codebase is noted as being largely borrowed from Uniaudio and DAC.

Highlighted Details

The project is associated with the AAAI 2025 paper "Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model".
Multiple pre-trained models are available on Hugging Face, including variants for speech and general audio domains, utilizing different semantic backbones and training datasets.
Experiments highlight the approach's ability to enhance existing models, such as VALL-E.

Maintenance & Community

The provided README does not contain information regarding project maintainers, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

No specific license information is provided in the README. Consequently, details regarding commercial use or compatibility with closed-source projects are not available.

Limitations & Caveats

Some of the listed pre-trained models are explicitly stated as "not mentioned in paper," indicating they may represent experimental variations or extensions beyond the core publication. The significant reliance on external codebases (Uniaudio, DAC) may introduce implicit dependencies or licensing considerations.

xcodec by zhenye234

Explore Similar Projects

SoundStorm by yangdongchao

Large-Audio-Models by liusongxiang

SpeechGPT-2.0-preview by OpenMOSS

UniAudio by yangdongchao

awesome-large-audio-models by EmulationAI

zamia-speech by gooofy

soundstorm-pytorch by lucidrains

WavTokenizer by jishengpeng

SALMONN by bytedance

audiolm-pytorch by lucidrains

Kimi-Audio by MoonshotAI

Step-Audio by stepfun-ai