soundstorm-pytorch by lucidrains

Pytorch implementation of SoundStorm for efficient parallel audio generation

Created 2 years ago

1,545 stars

Top 26.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Shawn Wang

Editor of Latent Space

Sam Partee

Cofounder of Arcade

Project Summary

This repository provides a PyTorch implementation of Google Deepmind's SoundStorm, an efficient parallel audio generation model. It's designed for researchers and developers working on state-of-the-art speech synthesis and audio generation, enabling high-quality, fast audio creation.

How It Works

SoundStorm leverages a Masked Generative Transformer (inspired by MaskGIT) applied to residual vector quantized (RVQ) codes generated by the SoundStream model. It uses a Conformer architecture, well-suited for audio, to predict these quantized codes in parallel across multiple stages. This approach allows for efficient, high-fidelity audio generation by modeling the discrete audio tokens.

Quick Start & Requirements

Install: pip install soundstorm-pytorch
Prerequisites: PyTorch, Accelerate, Einops. GPU with CUDA is recommended for training and generation.
Usage examples and integration with SoundStream and Text-to-Semantic models are provided in the README.

Highlighted Details

Implements SoundStorm, an efficient parallel audio generation model.
Integrates with SoundStream for audio tokenization and optionally a Text-to-Semantic model for text-to-speech.
Utilizes a Conformer architecture with rotary positional embeddings.
Supports generation in a specified number of steps (e.g., 18) with a cosine annealing schedule.

Maintenance & Community

The project acknowledges contributions from Lucas Newman, Steven Hillis, and Jiang-Stan. It is sponsored by Hugging Face and Stability AI.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The text-to-speech functionality is a work-in-progress, with the TextToSemantic model architecture complete but lacking pretraining and pseudo-labeling logic. Some features like returning audio files as a list or a command-line tool are still in the "todo" list.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days