VieNeu-TTS by pnnbao97

Vietnamese TTS with instant voice cloning

Created 8 months ago

2,100 stars

Top 20.5% on SourcePulse

Project Summary

Summary

VieNeu-TTS provides an advanced, on-device Vietnamese Text-to-Speech (TTS) system featuring instant voice cloning. It targets developers and users needing high-quality, real-time, offline speech synthesis with speaker consistency and code-switching capabilities. The system offers production-ready audio generation directly on CPU or GPU, significantly enhancing Vietnamese TTS accessibility.

How It Works

The model employs a Qwen 0.5B LLM backbone and the NeuCodec audio codec, processing inputs within a 2048-token context. Its architecture is optimized for real-time 24kHz waveform generation. The project offers multiple formats, including PyTorch for maximum quality, and GGUF (Q4/Q8) variants specifically optimized for fast CPU inference and streaming, alongside ONNX for codec compatibility.

Quick Start & Requirements

Clone the repository and install dependencies using uv sync. Key requirements include Python 3.12+ and eSpeak NG for phonemization. Optional GPU acceleration requires llama-cpp-python with CUDA support, and LMDeploy optimizations can be installed for enhanced GPU performance. Detailed setup guides, including a video tutorial, are available.

Repo: https://github.com/pnnbao97/VieNeu-TTS
Docs: https://huggingface.co/pnnbao-ump/VieNeu-TTS

Highlighted Details

Instant voice cloning with high fidelity and speaker consistency.
Real-time 24kHz audio synthesis on CPU or GPU.
Multiple optimized model formats: PyTorch (best quality), GGUF Q4/Q8 (CPU optimized, streaming support), ONNX codec.
Supports Vietnamese and English code-switching.

Maintenance & Community

Developed by Phạm Nguyễn Ngọc Bảo, building upon NeuTTS Air. Community support is available via GitHub Issues and Hugging Face.

Licensing & Compatibility

Released under the permissive Apache License 2.0, suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

A Dockerized setup and fine-tuning code are planned but not yet released. GGUF models currently support only four specific reference voices. Streaming inference on GPU is also a future development goal.

Health Check

Last Commit

22 hours ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

383 stars in the last 30 days