valtec-tts by tronghieuit

Vietnamese TTS and voice cloning

Created 6 months ago

358 stars

Top 77.9% on SourcePulse

Project Summary

Valtec Vietnamese TTS offers an ultra-lightweight, CPU-only solution for text-to-speech and zero-shot voice cloning, targeting engineers and power users. It enables high-quality voice synthesis and cloning without GPU requirements, achieving speeds several times faster than real-time.

How It Works

The system employs a lightweight architecture with minimal parameters (~74.8M for zero-shot) and CPU-native design. It uses speaker and style encoders to capture voice identity and prosody from short audio samples (3-10s), achieving an impressive Real-Time Factor (RTF) below 0.3 on standard processors. This approach democratizes advanced TTS by removing hardware barriers and includes prosody transfer capabilities.

Quick Start & Requirements

Installation is via pip: pip install git+https://github.com/tronghieuit/valtec-tts.git. Requirements include Python 3.8+ and PyTorch 2.0+. CUDA is optional for multi-speaker TTS acceleration; core functionality runs on CPU. Linux is recommended for optimal phonemization. Models auto-download from Hugging Face.

Zero-Shot Voice Cloning Demo: https://huggingface.co/spaces/valtecAI-team/valtec-zeroshot-voice-cloning
Multi-Speaker TTS Demo: https://huggingface.co/spaces/valtecAI-team/valtec-vietnamese-tts

Highlighted Details

Ultra-lightweight: 74.8M parameters for zero-shot cloning (~285MB FP32).
CPU-only inference achieves RTF as low as 0.236 (over 4x faster than realtime) for zero-shot tasks.
Zero-shot voice cloning requires minimal reference audio (3-10 seconds) without fine-tuning.
Supports prosody transfer, replicating intonation, rhythm, and emotion from reference audio.
Features a dedicated Vietnamese phonemizer (Northern/Southern) and 5 built-in multi-speaker TTS voices.

Maintenance & Community

Developed by the ValtecAI Team. Specific community channels, notable contributors, sponsorships, or partnerships are not detailed in the README.

Licensing & Compatibility

Licensed under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International). This license strictly prohibits commercial use without explicit written permission, limiting applications to non-commercial projects and research.

Limitations & Caveats

Model optimized for Vietnamese; other languages may have lower quality. Cloned voice fidelity depends on reference audio quality. Highly unique voices might not be perfectly replicated. Not yet optimized for real-time streaming.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

21 stars in the last 30 days