VyvoTTS  by Vyvo-Labs

Text-to-Speech training and inference framework powered by Large Language Models

Created 8 months ago
254 stars

Top 99.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

VyvoTTS is an LLM-based framework for Text-to-Speech (TTS) training and inference, designed for researchers and power users. It offers a comprehensive suite of tools for creating and deploying custom TTS models, from training LLMs from scratch to efficient voice cloning and optimized inference, significantly streamlining the TTS development pipeline.

How It Works

This framework leverages Large Language Models (LLMs) for advanced TTS capabilities. It supports full pre-training of LLM models on custom datasets, fine-tuning for specific TTS tasks, and memory-efficient adaptation using Low-Rank Adaptation (LoRA). Novel neural techniques are employed for voice cloning. A unified tokenizer simplifies dataset preparation for both Qwen3 and LFM2 model architectures, facilitating flexible data handling and model compatibility.

Quick Start & Requirements

Installation involves setting up a Python 3.10 virtual environment with uv and installing dependencies via uv pip install -r requirements.txt. For lower-end GPUs (6GB+ VRAM), a Jupyter notebook (notebook/vyvotts-lfm2-train.ipynb) is available, requiring uv pip install jupyter notebook. Fine-tuning requires a minimum of 30GB VRAM.

Highlighted Details

  • Versatile Training: Supports pre-training from scratch, fine-tuning, and LoRA adaptation for efficient model customization.
  • Advanced Voice Cloning: Implements sophisticated neural techniques for high-fidelity voice cloning.
  • Multiple Inference Backends: Offers optimized inference through standard Transformers, memory-efficient Unsloth (4-bit/8-bit), high-quality HQQ (4-bit), and production-ready vLLM for maximum throughput.
  • Unified Tokenization: A single tokenizer handles both Qwen3 and LFM2 models, simplifying dataset processing.
  • Distributed Training: Features multi-GPU support via the accelerate library for scalable training.

Maintenance & Community

The provided README does not detail specific contributors, sponsorships, or community channels like Discord or Slack.

Licensing & Compatibility

The project is licensed under the permissive MIT License, which generally allows for commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

Fine-tuning operations demand substantial GPU resources, with a minimum requirement of 30GB VRAM. While options exist for lower-end GPUs, full-scale training and fine-tuning remain resource-intensive. The roadmap indicates ongoing development, suggesting some features may still be experimental or under active implementation.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.0%
4k
TTS model for human-like, expressive speech
Created 2 years ago
Updated 1 year ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.4%
57k
Few-shot voice cloning and TTS web UI
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.