VieNeu-TTS  by pnnbao97

Vietnamese TTS with instant voice cloning

Created 2 months ago
502 stars

Top 62.0% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

VieNeu-TTS provides an advanced, on-device Vietnamese Text-to-Speech (TTS) system featuring instant voice cloning. It targets developers and users needing high-quality, real-time, offline speech synthesis with speaker consistency and code-switching capabilities. The system offers production-ready audio generation directly on CPU or GPU, significantly enhancing Vietnamese TTS accessibility.

How It Works

The model employs a Qwen 0.5B LLM backbone and the NeuCodec audio codec, processing inputs within a 2048-token context. Its architecture is optimized for real-time 24kHz waveform generation. The project offers multiple formats, including PyTorch for maximum quality, and GGUF (Q4/Q8) variants specifically optimized for fast CPU inference and streaming, alongside ONNX for codec compatibility.

Quick Start & Requirements

Clone the repository and install dependencies using uv sync. Key requirements include Python 3.12+ and eSpeak NG for phonemization. Optional GPU acceleration requires llama-cpp-python with CUDA support, and LMDeploy optimizations can be installed for enhanced GPU performance. Detailed setup guides, including a video tutorial, are available.

  • Repo: https://github.com/pnnbao97/VieNeu-TTS
  • Docs: https://huggingface.co/pnnbao-ump/VieNeu-TTS

Highlighted Details

  • Instant voice cloning with high fidelity and speaker consistency.
  • Real-time 24kHz audio synthesis on CPU or GPU.
  • Multiple optimized model formats: PyTorch (best quality), GGUF Q4/Q8 (CPU optimized, streaming support), ONNX codec.
  • Supports Vietnamese and English code-switching.

Maintenance & Community

Developed by Phạm Nguyễn Ngọc Bảo, building upon NeuTTS Air. Community support is available via GitHub Issues and Hugging Face.

Licensing & Compatibility

Released under the permissive Apache License 2.0, suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

A Dockerized setup and fine-tuning code are planned but not yet released. GGUF models currently support only four specific reference voices. Streaming inference on GPU is also a future development goal.

Health Check
Last Commit

15 hours ago

Responsiveness

Inactive

Pull Requests (30d)
36
Issues (30d)
9
Star History
222 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.