Pytorch implementation of a multi-speaker text-to-speech attention network
Top 95.8% on sourcepulse
Spear-TTS is a PyTorch implementation of a multi-speaker text-to-speech attention network, designed for high-fidelity speech generation with minimal supervision. It is particularly relevant for researchers and developers working on advanced speech synthesis systems, such as SoundStorm, and those interested in leveraging techniques like backtranslation and speculative decoding for improved efficiency and quality.
How It Works
The core of Spear-TTS is a Text-to-Semantic transformer model that converts text into semantic representations. It utilizes a pre-trained HubertWithKmeans model for speech feature extraction. The architecture incorporates grouped query attention for memory-efficient decoding and supports optimizations like FlashAttention. The project also includes a SemanticToTextDatasetGenerator
for creating pseudo-labeled datasets via backtranslation, enabling a multi-stage training process from speech-to-speech reconstruction to fine-tuning on text-to-speech.
Quick Start & Requirements
pip install spear-tts-pytorch
hubert_base_ls960.pt
) and Kmeans model weights (hubert_base_ls960_L9_km500.bin
).Highlighted Details
Maintenance & Community
The project acknowledges contributions from Lucas Newman for specific implementation details. Sponsorships from Stability AI are noted. Further community engagement channels are not explicitly listed in the README.
Licensing & Compatibility
The project's licensing is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking is therefore not specified.
Limitations & Caveats
The README indicates several "todo" items, including polishing the audio-text generation workflow and concatenating real audio-text datasets with generated ones. Some advanced features like cached key/values for starter and specialized causal masks for FlashAttention are still pending.
1 year ago
Inactive