valtec-tts  by tronghieuit

Vietnamese TTS and voice cloning

Created 3 months ago
310 stars

Top 86.8% on SourcePulse

GitHubView on GitHub
Project Summary

Valtec Vietnamese TTS offers an ultra-lightweight, CPU-only solution for text-to-speech and zero-shot voice cloning, targeting engineers and power users. It enables high-quality voice synthesis and cloning without GPU requirements, achieving speeds several times faster than real-time.

How It Works

The system employs a lightweight architecture with minimal parameters (~74.8M for zero-shot) and CPU-native design. It uses speaker and style encoders to capture voice identity and prosody from short audio samples (3-10s), achieving an impressive Real-Time Factor (RTF) below 0.3 on standard processors. This approach democratizes advanced TTS by removing hardware barriers and includes prosody transfer capabilities.

Quick Start & Requirements

Installation is via pip: pip install git+https://github.com/tronghieuit/valtec-tts.git. Requirements include Python 3.8+ and PyTorch 2.0+. CUDA is optional for multi-speaker TTS acceleration; core functionality runs on CPU. Linux is recommended for optimal phonemization. Models auto-download from Hugging Face.

Highlighted Details

  • Ultra-lightweight: 74.8M parameters for zero-shot cloning (~285MB FP32).
  • CPU-only inference achieves RTF as low as 0.236 (over 4x faster than realtime) for zero-shot tasks.
  • Zero-shot voice cloning requires minimal reference audio (3-10 seconds) without fine-tuning.
  • Supports prosody transfer, replicating intonation, rhythm, and emotion from reference audio.
  • Features a dedicated Vietnamese phonemizer (Northern/Southern) and 5 built-in multi-speaker TTS voices.

Maintenance & Community

Developed by the ValtecAI Team. Specific community channels, notable contributors, sponsorships, or partnerships are not detailed in the README.

Licensing & Compatibility

Licensed under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International). This license strictly prohibits commercial use without explicit written permission, limiting applications to non-commercial projects and research.

Limitations & Caveats

Model optimized for Vietnamese; other languages may have lower quality. Cloned voice fidelity depends on reference audio quality. Highly unique voices might not be perfectly replicated. Not yet optimized for real-time streaming.

Health Check
Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
24 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.