VyvoTTS by Vyvo-Labs

Text-to-Speech training and inference framework powered by Large Language Models

Created 11 months ago

257 stars

Top 98.3% on SourcePulse

Project Summary

Summary

VyvoTTS is an LLM-based framework for Text-to-Speech (TTS) training and inference, designed for researchers and power users. It offers a comprehensive suite of tools for creating and deploying custom TTS models, from training LLMs from scratch to efficient voice cloning and optimized inference, significantly streamlining the TTS development pipeline.

How It Works

This framework leverages Large Language Models (LLMs) for advanced TTS capabilities. It supports full pre-training of LLM models on custom datasets, fine-tuning for specific TTS tasks, and memory-efficient adaptation using Low-Rank Adaptation (LoRA). Novel neural techniques are employed for voice cloning. A unified tokenizer simplifies dataset preparation for both Qwen3 and LFM2 model architectures, facilitating flexible data handling and model compatibility.

Quick Start & Requirements

Installation involves setting up a Python 3.10 virtual environment with uv and installing dependencies via uv pip install -r requirements.txt. For lower-end GPUs (6GB+ VRAM), a Jupyter notebook (notebook/vyvotts-lfm2-train.ipynb) is available, requiring uv pip install jupyter notebook. Fine-tuning requires a minimum of 30GB VRAM.

Highlighted Details

Versatile Training: Supports pre-training from scratch, fine-tuning, and LoRA adaptation for efficient model customization.
Advanced Voice Cloning: Implements sophisticated neural techniques for high-fidelity voice cloning.
Multiple Inference Backends: Offers optimized inference through standard Transformers, memory-efficient Unsloth (4-bit/8-bit), high-quality HQQ (4-bit), and production-ready vLLM for maximum throughput.
Unified Tokenization: A single tokenizer handles both Qwen3 and LFM2 models, simplifying dataset processing.
Distributed Training: Features multi-GPU support via the accelerate library for scalable training.

Maintenance & Community

The provided README does not detail specific contributors, sponsorships, or community channels like Discord or Slack.

Licensing & Compatibility

The project is licensed under the permissive MIT License, which generally allows for commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

Fine-tuning operations demand substantial GPU resources, with a minimum requirement of 30GB VRAM. While options exist for lower-end GPUs, full-scale training and fine-tuning remain resource-intensive. The roadmap indicates ongoing development, suggesting some features may still be experimental or under active implementation.

VyvoTTS by Vyvo-Labs

Explore Similar Projects

pheme by PolyAI-LDN

Kitten-TTS-Server by devnen

radtts by NVIDIA

VITA-Audio by VITA-MLLM

GPA by AutoArk

faster-qwen3-tts by andimarafioti

IMS-Toucan by DigitalPhonetics

metavoice-src by metavoiceio

Spark-TTS by SparkAudio

TTS by mozilla

TTS by coqui-ai

GPT-SoVITS by RVC-Boss