soundstorm-pytorch  by lucidrains

Pytorch implementation of SoundStorm for efficient parallel audio generation

created 2 years ago
1,523 stars

Top 27.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a PyTorch implementation of Google Deepmind's SoundStorm, an efficient parallel audio generation model. It's designed for researchers and developers working on state-of-the-art speech synthesis and audio generation, enabling high-quality, fast audio creation.

How It Works

SoundStorm leverages a Masked Generative Transformer (inspired by MaskGIT) applied to residual vector quantized (RVQ) codes generated by the SoundStream model. It uses a Conformer architecture, well-suited for audio, to predict these quantized codes in parallel across multiple stages. This approach allows for efficient, high-fidelity audio generation by modeling the discrete audio tokens.

Quick Start & Requirements

  • Install: pip install soundstorm-pytorch
  • Prerequisites: PyTorch, Accelerate, Einops. GPU with CUDA is recommended for training and generation.
  • Usage examples and integration with SoundStream and Text-to-Semantic models are provided in the README.

Highlighted Details

  • Implements SoundStorm, an efficient parallel audio generation model.
  • Integrates with SoundStream for audio tokenization and optionally a Text-to-Semantic model for text-to-speech.
  • Utilizes a Conformer architecture with rotary positional embeddings.
  • Supports generation in a specified number of steps (e.g., 18) with a cosine annealing schedule.

Maintenance & Community

The project acknowledges contributions from Lucas Newman, Steven Hillis, and Jiang-Stan. It is sponsored by Hugging Face and Stability AI.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The text-to-speech functionality is a work-in-progress, with the TextToSemantic model architecture complete but lacking pretraining and pseudo-labeling logic. Some features like returning audio files as a list or a command-line tool are still in the "todo" list.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
35 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

AudioGPT by AIGC-Audio

0.1%
10k
Audio processing and generation research project
created 2 years ago
updated 1 year ago
Feedback? Help us improve.