spear-tts-pytorch  by lucidrains

Pytorch implementation of a multi-speaker text-to-speech attention network

created 2 years ago
271 stars

Top 95.8% on sourcepulse

GitHubView on GitHub
Project Summary

Spear-TTS is a PyTorch implementation of a multi-speaker text-to-speech attention network, designed for high-fidelity speech generation with minimal supervision. It is particularly relevant for researchers and developers working on advanced speech synthesis systems, such as SoundStorm, and those interested in leveraging techniques like backtranslation and speculative decoding for improved efficiency and quality.

How It Works

The core of Spear-TTS is a Text-to-Semantic transformer model that converts text into semantic representations. It utilizes a pre-trained HubertWithKmeans model for speech feature extraction. The architecture incorporates grouped query attention for memory-efficient decoding and supports optimizations like FlashAttention. The project also includes a SemanticToTextDatasetGenerator for creating pseudo-labeled datasets via backtranslation, enabling a multi-stage training process from speech-to-speech reconstruction to fine-tuning on text-to-speech.

Quick Start & Requirements

  • Install: pip install spear-tts-pytorch
  • Prerequisites: Requires pre-trained Hubert model weights (hubert_base_ls960.pt) and Kmeans model weights (hubert_base_ls960_L9_km500.bin).
  • Usage example and dataset generation are provided in the README.

Highlighted Details

  • Implements Spear-TTS, a multi-speaker text-to-speech attention network.
  • Supports backtranslation and beam search decoding for pseudo-dataset generation.
  • Integrates speculative decoding for faster inference.
  • Leverages grouped query attention and FlashAttention for efficiency.

Maintenance & Community

The project acknowledges contributions from Lucas Newman for specific implementation details. Sponsorships from Stability AI are noted. Further community engagement channels are not explicitly listed in the README.

Licensing & Compatibility

The project's licensing is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking is therefore not specified.

Limitations & Caveats

The README indicates several "todo" items, including polishing the audio-text generation workflow and concatenating real audio-text datasets with generated ones. Some advanced features like cached key/values for starter and specialized causal masks for FlashAttention are still pending.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.