audiolm-pytorch  by lucidrains

PyTorch implementation of Google's AudioLM for audio generation

created 2 years ago
2,569 stars

Top 18.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a PyTorch implementation of Google's AudioLM, a language modeling approach to audio generation. It targets researchers and developers interested in replicating state-of-the-art audio synthesis, with extensions for text-to-audio and text-to-speech (TTS) capabilities via classifier-free guidance.

How It Works

The implementation follows AudioLM's hierarchical structure, comprising three transformers: Semantic, Coarse, and Fine. It utilizes a neural audio codec, offering both a pre-trained EnCodec (MIT licensed) and a trainable SoundStream component. SoundStream itself employs residual vector quantization and local attention mechanisms for efficient audio tokenization. The system can be conditioned on text using a modified Semantic Transformer, enabling text-driven audio generation.

Quick Start & Requirements

  • Install via pip: pip install audiolm-pytorch
  • Requires PyTorch.
  • For training SoundStream, a large corpus of audio files is needed.
  • For using the provided HuBERT checkpoints, download them from the Fairseq repository.
  • Official documentation and examples are available within the repository.

Highlighted Details

  • Extends AudioLM with text conditioning and classifier-free guidance, enabling TTS.
  • Includes a MIT-licensed implementation of SoundStream, compatible with EnCodec.
  • Supports Flash Attention for improved transformer performance.
  • Offers options for residual vector quantization techniques in SoundStream.

Maintenance & Community

The project is actively maintained by lucidrains and has received contributions from numerous individuals listed in the README. Stability.ai provides sponsorship. Community discussion channels are not explicitly mentioned, but Hugging Face libraries are leveraged.

Licensing & Compatibility

The core audiolm-pytorch library appears to be MIT licensed. However, it integrates components like EnCodec and SoundStream which are also MIT licensed. Compatibility with commercial or closed-source projects should be verified based on the specific licenses of all integrated components.

Limitations & Caveats

Training SoundStream requires significant computational resources and a large audio dataset. The implementation of text conditioning is an extension beyond the original AudioLM paper, and its performance may vary. Some advanced features like key/value caching and structured dropout are listed as "to-do".

Health Check
Last commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
42 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

AudioGPT by AIGC-Audio

0.1%
10k
Audio processing and generation research project
created 2 years ago
updated 1 year ago
Feedback? Help us improve.