PyTorch code for parallel audio generation
Top 96.7% on sourcepulse
This repository provides an unofficial PyTorch implementation of Google's SoundStorm, a system for efficient parallel audio generation. It is targeted at researchers and developers working on text-to-speech (TTS) and audio synthesis, offering a parallel generation approach that significantly speeds up audio creation compared to sequential methods.
How It Works
The implementation uses a mask-based discrete diffusion model to predict acoustic tokens in parallel, conditioned on semantic tokens extracted by HuBERT. Unlike the original SoundStorm which uses a sum operation to combine multiple codebooks, this version employs a shallow U-Net for codebook combination. The audio codec used is the open-source AcademiCodec.
Quick Start & Requirements
start/start.sh
script for training. Inference is initiated via python generate_samples_batch.py
after modifying the script.data_sample
instructions. Specific hardware or software versions (e.g., CUDA, Python versions) are not explicitly detailed but are implied by PyTorch and audio processing dependencies.Highlighted Details
Maintenance & Community
The project is marked as "wip" (work in progress). The primary contributor is yangdongchao. No specific community channels (Discord, Slack) or roadmap are mentioned in the README.
Licensing & Compatibility
The README does not explicitly state a license. Given the unofficial nature and lack of explicit licensing, commercial use or linking with closed-source projects should be approached with caution until a license is clarified.
Limitations & Caveats
This is an unofficial implementation and is marked as "wip," indicating potential instability or incomplete features. The README mentions a planned second version based on MASKGIT, suggesting the current version might not fully align with the latest SoundStorm developments. Specific requirements for dataset preparation and computational resources are not fully detailed.
1 year ago
Inactive