soundstorm-pytorch  by lucidrains

Pytorch implementation of SoundStorm for efficient parallel audio generation

Created 2 years ago
1,534 stars

Top 27.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a PyTorch implementation of Google Deepmind's SoundStorm, an efficient parallel audio generation model. It's designed for researchers and developers working on state-of-the-art speech synthesis and audio generation, enabling high-quality, fast audio creation.

How It Works

SoundStorm leverages a Masked Generative Transformer (inspired by MaskGIT) applied to residual vector quantized (RVQ) codes generated by the SoundStream model. It uses a Conformer architecture, well-suited for audio, to predict these quantized codes in parallel across multiple stages. This approach allows for efficient, high-fidelity audio generation by modeling the discrete audio tokens.

Quick Start & Requirements

  • Install: pip install soundstorm-pytorch
  • Prerequisites: PyTorch, Accelerate, Einops. GPU with CUDA is recommended for training and generation.
  • Usage examples and integration with SoundStream and Text-to-Semantic models are provided in the README.

Highlighted Details

  • Implements SoundStorm, an efficient parallel audio generation model.
  • Integrates with SoundStream for audio tokenization and optionally a Text-to-Semantic model for text-to-speech.
  • Utilizes a Conformer architecture with rotary positional embeddings.
  • Supports generation in a specified number of steps (e.g., 18) with a cosine annealing schedule.

Maintenance & Community

The project acknowledges contributions from Lucas Newman, Steven Hillis, and Jiang-Stan. It is sponsored by Hugging Face and Stability AI.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The text-to-speech functionality is a work-in-progress, with the TextToSemantic model architecture complete but lacking pretraining and pseudo-labeling logic. Some features like returning audio files as a list or a command-line tool are still in the "todo" list.

Health Check
Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.