index-tts-lora by asr-pub

High-quality speech synthesis via LoRA fine-tuning

Created 4 months ago

280 stars

Top 93.0% on SourcePulse

Project Summary

This project offers LoRA fine-tuning solutions for the index-tts high-quality speech synthesis model. It enables users to enhance prosody and naturalness for both single and multi-speaker voice generation, targeting researchers and developers seeking customized TTS capabilities.

How It Works

This project implements LoRA (Low-Rank Adaptation) fine-tuning on top of the index-tts speech synthesis model. The core workflow involves preparing custom datasets by extracting audio tokens and speaker conditioning vectors using provided Python scripts. These processed features, alongside speaker metadata, are then fed into the training script (train.py) to adapt the model. LoRA enables efficient adaptation by training only a small number of additional parameters, significantly reducing computational cost and memory requirements compared to full model fine-tuning, while aiming to enhance prosody and naturalness.

Quick Start & Requirements

Primary commands:
- Audio processing: python tools/extract_codec.py --audio_list ${audio_list} --extract_condition
- Training: python train.py
- Inference: python indextts/infer.py
Prerequisites: Audio data requires transcripts. Input format for audio_list is audio_path + transcript (e.g., /path/to/audio.wav 小朋友们，大家好...). GPU acceleration is recommended.
Dependencies: Relies on the index-tts library. Specific Python version or other non-default system requirements are not detailed.

Highlighted Details

Demonstrates fine-tuning on a ~30-minute Chinese dataset ("Kai Shu Tells Stories") with 270 audio clips.
Supports synthesis of mixed-language (Chinese/English) content and tongue twisters.
References the "2025 Benchmark of Mainstream TTS Models" for evaluation context.

Limitations & Caveats

Transcripts used for fine-tuning were automatically generated via ASR and punctuation models, potentially containing errors that could affect synthesis quality. The README does not specify supported operating systems or explicit hardware requirements beyond the implicit need for GPU.

index-tts-lora by asr-pub

Explore Similar Projects

ControlSpeech by jishengpeng

ComfyUI-F5-TTS by niknah

StyleSpeech by KevinMIN95

SonicVale by xcLee001

VibeVoice-finetuning by voicepowered-ai

MOSS-TTSD by OpenMOSS

speech-synthesis-paper by wenet-e2e

KittenTTS by KittenML

higgs-audio by boson-ai

StyleTTS2 by yl4579

Zonos by Zyphra

whisper-vits-svc by PlayVoice