SoundStorm by yangdongchao

PyTorch code for parallel audio generation

Created 2 years ago

269 stars

Top 95.6% on SourcePulse

Project Summary

This repository provides an unofficial PyTorch implementation of Google's SoundStorm, a system for efficient parallel audio generation. It is targeted at researchers and developers working on text-to-speech (TTS) and audio synthesis, offering a parallel generation approach that significantly speeds up audio creation compared to sequential methods.

How It Works

The implementation uses a mask-based discrete diffusion model to predict acoustic tokens in parallel, conditioned on semantic tokens extracted by HuBERT. Unlike the original SoundStorm which uses a sum operation to combine multiple codebooks, this version employs a shallow U-Net for codebook combination. The audio codec used is the open-source AcademiCodec.

Quick Start & Requirements

Install/Run: Use the provided start/start.sh script for training. Inference is initiated via python generate_samples_batch.py after modifying the script.
Prerequisites: Requires a dataset prepared according to data_sample instructions. Specific hardware or software versions (e.g., CUDA, Python versions) are not explicitly detailed but are implied by PyTorch and audio processing dependencies.
Resources: Dataset preparation is a key step. Model training and inference will likely require significant computational resources, including GPUs.
Links:
- AcademiCodec: https://github.com/yangdongchao/AcademiCodec
- Paper (InstructTTS): https://arxiv.org/pdf/2301.13662.pdf

Highlighted Details

Unofficial PyTorch implementation of Google's SoundStorm.
Utilizes mask-based discrete diffusion for parallel acoustic token prediction.
Leverages HuBERT for semantic token extraction.
Employs a shallow U-Net for combining codebooks.

Maintenance & Community

The project is marked as "wip" (work in progress). The primary contributor is yangdongchao. No specific community channels (Discord, Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. Given the unofficial nature and lack of explicit licensing, commercial use or linking with closed-source projects should be approached with caution until a license is clarified.

Limitations & Caveats

This is an unofficial implementation and is marked as "wip," indicating potential instability or incomplete features. The README mentions a planned second version based on MASKGIT, suggesting the current version might not fully align with the latest SoundStorm developments. Specific requirements for dataset preparation and computational resources are not fully detailed.

SoundStorm by yangdongchao

Explore Similar Projects

xcodec by zhenye234

stable-codec by Stability-AI

UniAudio by yangdongchao

PlayDiffusion by playht

VITA-Audio by VITA-MLLM

vui by fluxions-ai

soundstorm-pytorch by lucidrains

fish-diffusion by fishaudio

WavTokenizer by jishengpeng

audiolm-pytorch by lucidrains

Kimi-Audio by MoonshotAI

audiocraft by facebookresearch