LuxTTS by ysharma3501

TTS voice cloning for rapid, high-quality generation

Created 5 months ago

4,777 stars

Top 10.2% on SourcePulse

Project Summary

LuxTTS is a lightweight, ZipVoice-based text-to-speech (TTS) model designed for high-quality, rapid voice cloning and realistic speech generation. It targets engineers and researchers seeking efficient TTS solutions capable of exceeding 150x real-time performance, offering SOTA voice cloning and clear 48kHz audio output within a minimal 1GB VRAM footprint.

How It Works

LuxTTS builds upon the ZipVoice architecture, optimizing it through distillation into a 4-step process and incorporating an improved sampling technique. A key differentiator is its custom 48kHz vocoder, which produces clearer speech compared to the typical 24kHz output of many TTS models. This approach yields state-of-the-art voice cloning capabilities comparable to significantly larger models, while achieving remarkable inference speeds.

Quick Start & Requirements

Installation: Clone the repository (git clone https://github.com/ysharma3501/LuxTTS.git), navigate into the directory (cd LuxTTS), and install dependencies (pip install -r requirements.txt).
Prerequisites: Python environment, PyTorch (implied by device='cuda'), and audio processing libraries (e.g., librosa, soundfile from requirements.txt). CUDA is recommended for GPU acceleration.
Resource Footprint: Requires approximately 1GB of VRAM.
Links: GitHub Repository, Huggingface Spaces Demo (mentioned).

Highlighted Details

Achieves inference speeds exceeding 150x real-time on a single GPU.
Provides SOTA voice cloning performance, competitive with models ten times its size.
Generates clear, high-fidelity speech at 48kHz.
Efficiently fits within 1GB of VRAM, enabling local deployment on most GPUs.

Maintenance & Community

The project has released its core model and code, along with a Huggingface Spaces demo. Future roadmap items include support for MPS (Apple Silicon) and the release of code for float16 inference, which is expected to further increase speed. Direct contact is available via email: yatharthsharma350@gmail.com.

Licensing & Compatibility

The project is licensed under the Apache-2.0 license. This permissive license generally allows for commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

Support for MPS devices and optimized float16 inference are not yet implemented. The initial audio encoding step (encode_prompt) has a notable ~10-second initialization delay on first use due to librosa. The ref_duration parameter can be adjusted to balance inference speed against potential artifacts.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

637 stars in the last 30 days