Pytorch implementation of SoundStorm for efficient parallel audio generation
Top 27.7% on sourcepulse
This repository provides a PyTorch implementation of Google Deepmind's SoundStorm, an efficient parallel audio generation model. It's designed for researchers and developers working on state-of-the-art speech synthesis and audio generation, enabling high-quality, fast audio creation.
How It Works
SoundStorm leverages a Masked Generative Transformer (inspired by MaskGIT) applied to residual vector quantized (RVQ) codes generated by the SoundStream model. It uses a Conformer architecture, well-suited for audio, to predict these quantized codes in parallel across multiple stages. This approach allows for efficient, high-fidelity audio generation by modeling the discrete audio tokens.
Quick Start & Requirements
pip install soundstorm-pytorch
Highlighted Details
Maintenance & Community
The project acknowledges contributions from Lucas Newman, Steven Hillis, and Jiang-Stan. It is sponsored by Hugging Face and Stability AI.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The text-to-speech functionality is a work-in-progress, with the TextToSemantic model architecture complete but lacking pretraining and pseudo-labeling logic. Some features like returning audio files as a list or a command-line tool are still in the "todo" list.
3 months ago
1 day