SoundStorm  by yangdongchao

PyTorch code for parallel audio generation

created 2 years ago
267 stars

Top 96.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an unofficial PyTorch implementation of Google's SoundStorm, a system for efficient parallel audio generation. It is targeted at researchers and developers working on text-to-speech (TTS) and audio synthesis, offering a parallel generation approach that significantly speeds up audio creation compared to sequential methods.

How It Works

The implementation uses a mask-based discrete diffusion model to predict acoustic tokens in parallel, conditioned on semantic tokens extracted by HuBERT. Unlike the original SoundStorm which uses a sum operation to combine multiple codebooks, this version employs a shallow U-Net for codebook combination. The audio codec used is the open-source AcademiCodec.

Quick Start & Requirements

  • Install/Run: Use the provided start/start.sh script for training. Inference is initiated via python generate_samples_batch.py after modifying the script.
  • Prerequisites: Requires a dataset prepared according to data_sample instructions. Specific hardware or software versions (e.g., CUDA, Python versions) are not explicitly detailed but are implied by PyTorch and audio processing dependencies.
  • Resources: Dataset preparation is a key step. Model training and inference will likely require significant computational resources, including GPUs.
  • Links:

Highlighted Details

  • Unofficial PyTorch implementation of Google's SoundStorm.
  • Utilizes mask-based discrete diffusion for parallel acoustic token prediction.
  • Leverages HuBERT for semantic token extraction.
  • Employs a shallow U-Net for combining codebooks.

Maintenance & Community

The project is marked as "wip" (work in progress). The primary contributor is yangdongchao. No specific community channels (Discord, Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. Given the unofficial nature and lack of explicit licensing, commercial use or linking with closed-source projects should be approached with caution until a license is clarified.

Limitations & Caveats

This is an unofficial implementation and is marked as "wip," indicating potential instability or incomplete features. The README mentions a planned second version based on MASKGIT, suggesting the current version might not fully align with the latest SoundStorm developments. Specific requirements for dataset preparation and computational resources are not fully detailed.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.