WavTokenizer  by jishengpeng

Research paper for discrete acoustic codec models

created 11 months ago
1,171 stars

Top 33.9% on sourcepulse

GitHubView on GitHub
Project Summary

WavTokenizer provides state-of-the-art discrete acoustic codec models for audio language modeling, enabling efficient representation of speech and music with low token rates. It is designed for researchers and developers working on advanced audio processing and generative AI models, offering strong reconstruction capabilities and rich semantic information.

How It Works

WavTokenizer employs a neural network architecture to quantize audio waveforms into discrete tokens. It achieves high compression rates by learning a compact representation of audio signals, allowing for efficient processing and generation. The models are trained on diverse datasets, enabling them to capture nuances across speech, audio, and music domains.

Quick Start & Requirements

  • Install via conda create -n wavtokenizer python=3.9 and conda activate wavtokenizer, then pip install -r requirements.txt.
  • Requires PyTorch and torchaudio.
  • Official HuggingFace model hub links are provided for various checkpoints.

Highlighted Details

  • Achieves 40 or 75 tokens per second for audio representation.
  • Offers strong audio reconstruction results.
  • Models are available for speech, audio, and music domains.
  • Supports training from scratch with provided configuration and training scripts.

Maintenance & Community

The project has released multiple checkpoints and updates, including camera-ready versions for ICLR 2025 and arXiv preprints. Links to HuggingFace for models are available.

Licensing & Compatibility

The project is open-source, indicated by a checkmark in the "Open-Source" column for all listed models. Specific license details are not explicitly stated in the README, but open-source availability suggests compatibility with many research and commercial applications.

Limitations & Caveats

The README mentions using configs/xxx.yaml and xxx.ckpt, implying that users need to obtain or configure these specific files, which are not directly provided in the main repository structure. The training process refers to PyTorch Lightning documentation for customization, suggesting a learning curve for custom training setups.

Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
57 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

AudioGPT by AIGC-Audio

0.1%
10k
Audio processing and generation research project
created 2 years ago
updated 1 year ago
Feedback? Help us improve.