nix-tts by rendchevi

Lightweight TTS research paper via module-wise distillation

Created 3 years ago

262 stars

Top 97.1% on SourcePulse

Project Summary

Nix-TTS offers a lightweight, end-to-end text-to-speech (TTS) solution by distilling knowledge from a larger, high-quality teacher model. It targets researchers and developers needing efficient TTS capabilities on resource-constrained devices, providing significant speedups and parameter reduction while maintaining reasonable voice quality.

How It Works

Nix-TTS employs module-wise knowledge distillation, a technique that allows for flexible and independent transfer of learned representations from a teacher model to specific components (encoder and decoder) of the student model. This approach enables the student model to inherit the non-autoregressive and vocoder-free characteristics of the teacher, resulting in a compact yet performant TTS system.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Install espeak: sudo apt-get install espeak
Download pre-trained models from the provided link.
Official Demo: 🤗 Interactive Demo
Audio Samples: 📢 Audio Samples

Highlighted Details

Achieves 5.23M parameters, up to an 89.34% reduction compared to the teacher model.
Offers inference speedups of 3.04x on Intel-i7 CPU and 8.36x on Raspberry Pi 3B.
Retains non-autoregressive and end-to-end (vocoder-free) properties.
Module-wise distillation allows for flexible student model design.

Maintenance & Community

Research funded and authors affiliated with Kata.ai.
Adapted components from VITS and Comprehensive-Transformer-TTS.
Paper Link: 📄 Paper Link

Licensing & Compatibility

The repository does not explicitly state a license. The README mentions funding by Kata.ai, implying potential proprietary use or restrictions. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The repository does not specify a license, which may impact commercial use. While the README claims speedups on Raspberry Pi 3B, the provided table indicates it's slower than real-time (0.50x). The naturalness and intelligibility are described as "fair" compared to the teacher model, suggesting a potential trade-off for size and speed.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days