PyTorch implementation of Google's AudioLM for audio generation
Top 18.7% on sourcepulse
This repository provides a PyTorch implementation of Google's AudioLM, a language modeling approach to audio generation. It targets researchers and developers interested in replicating state-of-the-art audio synthesis, with extensions for text-to-audio and text-to-speech (TTS) capabilities via classifier-free guidance.
How It Works
The implementation follows AudioLM's hierarchical structure, comprising three transformers: Semantic, Coarse, and Fine. It utilizes a neural audio codec, offering both a pre-trained EnCodec (MIT licensed) and a trainable SoundStream component. SoundStream itself employs residual vector quantization and local attention mechanisms for efficient audio tokenization. The system can be conditioned on text using a modified Semantic Transformer, enabling text-driven audio generation.
Quick Start & Requirements
pip install audiolm-pytorch
Highlighted Details
Maintenance & Community
The project is actively maintained by lucidrains and has received contributions from numerous individuals listed in the README. Stability.ai provides sponsorship. Community discussion channels are not explicitly mentioned, but Hugging Face libraries are leveraged.
Licensing & Compatibility
The core audiolm-pytorch
library appears to be MIT licensed. However, it integrates components like EnCodec and SoundStream which are also MIT licensed. Compatibility with commercial or closed-source projects should be verified based on the specific licenses of all integrated components.
Limitations & Caveats
Training SoundStream requires significant computational resources and a large audio dataset. The implementation of text conditioning is an extension beyond the original AudioLM paper, and its performance may vary. Some advanced features like key/value caching and structured dropout are listed as "to-do".
6 months ago
Inactive