valtec-tts  by tronghieuit

Vietnamese TTS and voice cloning

Created 2 months ago
281 stars

Top 92.9% on SourcePulse

GitHubView on GitHub
Project Summary

Valtec Vietnamese TTS offers an ultra-lightweight, CPU-only solution for text-to-speech and zero-shot voice cloning, targeting engineers and power users. It enables high-quality voice synthesis and cloning without GPU requirements, achieving speeds several times faster than real-time.

How It Works

The system employs a lightweight architecture with minimal parameters (~74.8M for zero-shot) and CPU-native design. It uses speaker and style encoders to capture voice identity and prosody from short audio samples (3-10s), achieving an impressive Real-Time Factor (RTF) below 0.3 on standard processors. This approach democratizes advanced TTS by removing hardware barriers and includes prosody transfer capabilities.

Quick Start & Requirements

Installation is via pip: pip install git+https://github.com/tronghieuit/valtec-tts.git. Requirements include Python 3.8+ and PyTorch 2.0+. CUDA is optional for multi-speaker TTS acceleration; core functionality runs on CPU. Linux is recommended for optimal phonemization. Models auto-download from Hugging Face.

Highlighted Details

  • Ultra-lightweight: 74.8M parameters for zero-shot cloning (~285MB FP32).
  • CPU-only inference achieves RTF as low as 0.236 (over 4x faster than realtime) for zero-shot tasks.
  • Zero-shot voice cloning requires minimal reference audio (3-10 seconds) without fine-tuning.
  • Supports prosody transfer, replicating intonation, rhythm, and emotion from reference audio.
  • Features a dedicated Vietnamese phonemizer (Northern/Southern) and 5 built-in multi-speaker TTS voices.

Maintenance & Community

Developed by the ValtecAI Team. Specific community channels, notable contributors, sponsorships, or partnerships are not detailed in the README.

Licensing & Compatibility

Licensed under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International). This license strictly prohibits commercial use without explicit written permission, limiting applications to non-commercial projects and research.

Limitations & Caveats

Model optimized for Vietnamese; other languages may have lower quality. Cloned voice fidelity depends on reference audio quality. Highly unique voices might not be perfectly replicated. Not yet optimized for real-time streaming.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
82 stars in the last 30 days

Explore Similar Projects

Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.4%
55k
Few-shot voice cloning and TTS web UI
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.