xcodec  by zhenye234

Unified semantic and acoustic codec for advanced audio language modeling

Created 1 year ago
259 stars

Top 98.0% on SourcePulse

GitHubView on GitHub
Project Summary

The zhenye234/xcodec project introduces a unified semantic and acoustic codec designed to address the semantic shortcomings of traditional audio codecs in audio language models. It offers a novel approach to enhance existing audio processing pipelines, benefiting researchers and practitioners in speech synthesis, audio generation, and natural language processing for audio.

How It Works

X-Codec integrates semantic feature extraction, typically from models like Hubert or WavLM, directly into the audio codec pipeline. The architecture combines acoustic encoder outputs with semantic model embeddings, processes them through projector layers, and then quantizes the unified representation using Residual Vector Quantization (RVQ). This allows the codec to capture richer semantic information, leading to improved audio quality and model performance. The approach is demonstrated with a `Codec` class incorporating acoustic and semantic modules, quantization, and decoding steps.

Quick Start & Requirements

  • Inference: Download models and configuration from Hugging Face, then execute python inference.py.
  • Training: Prepare training and validation files listing audio file paths. Initiate training via torchrun --nnodes=1 --nproc-per-node=8 main_launch_vqdp.py.
  • Prerequisites: Requires PyTorch, Hugging Face Transformers, and specific pre-trained models (e.g., Hubert, WavLM). The codebase is noted as being largely borrowed from Uniaudio and DAC.

Highlighted Details

  • The project is associated with the AAAI 2025 paper "Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model".
  • Multiple pre-trained models are available on Hugging Face, including variants for speech and general audio domains, utilizing different semantic backbones and training datasets.
  • Experiments highlight the approach's ability to enhance existing models, such as VALL-E.

Maintenance & Community

The provided README does not contain information regarding project maintainers, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

No specific license information is provided in the README. Consequently, details regarding commercial use or compatibility with closed-source projects are not available.

Limitations & Caveats

Some of the listed pre-trained models are explicitly stated as "not mentioned in paper," indicating they may represent experimental variations or extensions beyond the core publication. The significant reliance on external codebases (Uniaudio, DAC) may introduce implicit dependencies or licensing considerations.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.