audiolm-pytorch by lucidrains

PyTorch implementation of Google's AudioLM for audio generation

Created 3 years ago

2,617 stars

Top 17.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Benjamin Bolte

Cofounder of K-Scale Labs

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

This repository provides a PyTorch implementation of Google's AudioLM, a language modeling approach to audio generation. It targets researchers and developers interested in replicating state-of-the-art audio synthesis, with extensions for text-to-audio and text-to-speech (TTS) capabilities via classifier-free guidance.

How It Works

The implementation follows AudioLM's hierarchical structure, comprising three transformers: Semantic, Coarse, and Fine. It utilizes a neural audio codec, offering both a pre-trained EnCodec (MIT licensed) and a trainable SoundStream component. SoundStream itself employs residual vector quantization and local attention mechanisms for efficient audio tokenization. The system can be conditioned on text using a modified Semantic Transformer, enabling text-driven audio generation.

Quick Start & Requirements

Install via pip: pip install audiolm-pytorch
Requires PyTorch.
For training SoundStream, a large corpus of audio files is needed.
For using the provided HuBERT checkpoints, download them from the Fairseq repository.
Official documentation and examples are available within the repository.

Highlighted Details

Extends AudioLM with text conditioning and classifier-free guidance, enabling TTS.
Includes a MIT-licensed implementation of SoundStream, compatible with EnCodec.
Supports Flash Attention for improved transformer performance.
Offers options for residual vector quantization techniques in SoundStream.

Maintenance & Community

The project is actively maintained by lucidrains and has received contributions from numerous individuals listed in the README. Stability.ai provides sponsorship. Community discussion channels are not explicitly mentioned, but Hugging Face libraries are leveraged.

Licensing & Compatibility

The core audiolm-pytorch library appears to be MIT licensed. However, it integrates components like EnCodec and SoundStream which are also MIT licensed. Compatibility with commercial or closed-source projects should be verified based on the specific licenses of all integrated components.

Limitations & Caveats

Training SoundStream requires significant computational resources and a large audio dataset. The implementation of text conditioning is an extension beyond the original AudioLM paper, and its performance may vary. Some advanced features like key/value caching and structured dropout are listed as "to-do".

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days