VibeVoice-finetuning  by voicepowered-ai

Efficient LoRA finetuning for VibeVoice speech synthesis

Created 3 months ago
324 stars

Top 84.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This repository offers an unofficial, work-in-progress (WIP) implementation for LoRA finetuning of VibeVoice models. It targets users aiming to customize VibeVoice's text-to-speech capabilities, leveraging a dual-loss approach for enhanced acoustic and semantic generation.

How It Works

The project utilizes LoRA for parameter-efficient finetuning of VibeVoice 1.5B and 7B models. Training employs a dual-loss objective: masked Cross-Entropy (CE) on text tokens and diffusion Mean Squared Error (MSE) on acoustic latents. A custom collator constructs interleaved sequences with speech placeholders and computes masks for diffusion loss. The approach supports flexible voice prompt handling, including auto-generation from target audio or user-provided files, with random dropping during training for generalization.

Quick Start & Requirements

Installation requires cloning the repo, pip install -e ., and installing transformers==4.51.3. A tested Docker image is runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04. Hardware demands are high: 16GB VRAM for 1.5B models, 48GB VRAM for 7B models, with longer audio increasing VRAM needs. Audio data must be 24kHz or will be resampled.

Highlighted Details

  • Supports LoRA finetuning for VibeVoice 1.5B and 7B models.
  • Dual-loss strategy combines text CE loss and acoustic diffusion MSE loss.
  • Flexible voice prompt conditioning: auto-generation or user-provided, with random dropping (0.2 rate tested).
  • Collator builds interleaved sequences and computes diffusion loss masks.
  • LoRA applied to LLM (Qwen) and optionally diffusion head.

Maintenance & Community

No specific details on maintainers, community channels, or project roadmap are provided in the README.

Licensing & Compatibility

The README does not specify a software license, hindering assessment for commercial use or closed-source integration.

Limitations & Caveats

Marked as "Unofficial WIP," indicating potential instability. A strict dependency on transformers==4.51.3 may complicate integration. High VRAM requirements for larger models and extended audio present a hardware barrier. The lack of explicit licensing is a critical adoption blocker.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
42 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
2 more.

AudioGPT by AIGC-Audio

0.0%
10k
Audio processing and generation research project
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.