index-tts-lora  by asr-pub

High-quality speech synthesis via LoRA fine-tuning

Created 1 month ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

This project offers LoRA fine-tuning solutions for the index-tts high-quality speech synthesis model. It enables users to enhance prosody and naturalness for both single and multi-speaker voice generation, targeting researchers and developers seeking customized TTS capabilities.

How It Works

This project implements LoRA (Low-Rank Adaptation) fine-tuning on top of the index-tts speech synthesis model. The core workflow involves preparing custom datasets by extracting audio tokens and speaker conditioning vectors using provided Python scripts. These processed features, alongside speaker metadata, are then fed into the training script (train.py) to adapt the model. LoRA enables efficient adaptation by training only a small number of additional parameters, significantly reducing computational cost and memory requirements compared to full model fine-tuning, while aiming to enhance prosody and naturalness.

Quick Start & Requirements

  • Primary commands:
    • Audio processing: python tools/extract_codec.py --audio_list ${audio_list} --extract_condition
    • Training: python train.py
    • Inference: python indextts/infer.py
  • Prerequisites: Audio data requires transcripts. Input format for audio_list is audio_path + transcript (e.g., /path/to/audio.wav 小朋友们,大家好...). GPU acceleration is recommended.
  • Dependencies: Relies on the index-tts library. Specific Python version or other non-default system requirements are not detailed.

Highlighted Details

  • Demonstrates fine-tuning on a ~30-minute Chinese dataset ("Kai Shu Tells Stories") with 270 audio clips.
  • Supports synthesis of mixed-language (Chinese/English) content and tongue twisters.
  • References the "2025 Benchmark of Mainstream TTS Models" for evaluation context.

Limitations & Caveats

Transcripts used for fine-tuning were automatically generated via ASR and punctuation models, potentially containing errors that could affect synthesis quality. The README does not specify supported operating systems or explicit hardware requirements beyond the implicit need for GPU.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
48 stars in the last 30 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

StyleTTS2 by yl4579

0.3%
6k
Text-to-speech model achieving human-level synthesis
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.