VibeVoice-finetuning by voicepowered-ai

Efficient LoRA finetuning for VibeVoice speech synthesis

Created 3 months ago

324 stars

Top 84.1% on SourcePulse

Project Summary

Summary

This repository offers an unofficial, work-in-progress (WIP) implementation for LoRA finetuning of VibeVoice models. It targets users aiming to customize VibeVoice's text-to-speech capabilities, leveraging a dual-loss approach for enhanced acoustic and semantic generation.

How It Works

The project utilizes LoRA for parameter-efficient finetuning of VibeVoice 1.5B and 7B models. Training employs a dual-loss objective: masked Cross-Entropy (CE) on text tokens and diffusion Mean Squared Error (MSE) on acoustic latents. A custom collator constructs interleaved sequences with speech placeholders and computes masks for diffusion loss. The approach supports flexible voice prompt handling, including auto-generation from target audio or user-provided files, with random dropping during training for generalization.

Quick Start & Requirements

Installation requires cloning the repo, pip install -e ., and installing transformers==4.51.3. A tested Docker image is runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04. Hardware demands are high: 16GB VRAM for 1.5B models, 48GB VRAM for 7B models, with longer audio increasing VRAM needs. Audio data must be 24kHz or will be resampled.

Highlighted Details

Supports LoRA finetuning for VibeVoice 1.5B and 7B models.
Dual-loss strategy combines text CE loss and acoustic diffusion MSE loss.
Flexible voice prompt conditioning: auto-generation or user-provided, with random dropping (0.2 rate tested).
Collator builds interleaved sequences and computes diffusion loss masks.
LoRA applied to LLM (Qwen) and optionally diffusion head.

Maintenance & Community

No specific details on maintainers, community channels, or project roadmap are provided in the README.

Licensing & Compatibility

The README does not specify a software license, hindering assessment for commercial use or closed-source integration.

Limitations & Caveats

Marked as "Unofficial WIP," indicating potential instability. A strict dependency on transformers==4.51.3 may complicate integration. High VRAM requirements for larger models and extended audio present a hardware barrier. The lack of explicit licensing is a critical adoption blocker.

VibeVoice-finetuning by voicepowered-ai

Explore Similar Projects

ControlSpeech by jishengpeng

ComfyUI-F5-TTS by niknah

ComfyUI-VoxCPM by wildminder

ComfyUI_IndexTTS by billwuhao

MOSS-TTSD by OpenMOSS

ComfyUI-VibeVoice by wildminder

KittenTTS by KittenML

Zonos by Zyphra

AudioGPT by AIGC-Audio

whisper-vits-svc by PlayVoice

VibeVoice by microsoft

Real-Time-Voice-Cloning by CorentinJ