Discover and explore top open-source AI tools and projects—updated daily.
voicepowered-aiEfficient LoRA finetuning for VibeVoice speech synthesis
Top 84.1% on SourcePulse
Summary
This repository offers an unofficial, work-in-progress (WIP) implementation for LoRA finetuning of VibeVoice models. It targets users aiming to customize VibeVoice's text-to-speech capabilities, leveraging a dual-loss approach for enhanced acoustic and semantic generation.
How It Works
The project utilizes LoRA for parameter-efficient finetuning of VibeVoice 1.5B and 7B models. Training employs a dual-loss objective: masked Cross-Entropy (CE) on text tokens and diffusion Mean Squared Error (MSE) on acoustic latents. A custom collator constructs interleaved sequences with speech placeholders and computes masks for diffusion loss. The approach supports flexible voice prompt handling, including auto-generation from target audio or user-provided files, with random dropping during training for generalization.
Quick Start & Requirements
Installation requires cloning the repo, pip install -e ., and installing transformers==4.51.3. A tested Docker image is runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04. Hardware demands are high: 16GB VRAM for 1.5B models, 48GB VRAM for 7B models, with longer audio increasing VRAM needs. Audio data must be 24kHz or will be resampled.
Highlighted Details
Maintenance & Community
No specific details on maintainers, community channels, or project roadmap are provided in the README.
Licensing & Compatibility
The README does not specify a software license, hindering assessment for commercial use or closed-source integration.
Limitations & Caveats
Marked as "Unofficial WIP," indicating potential instability. A strict dependency on transformers==4.51.3 may complicate integration. High VRAM requirements for larger models and extended audio present a hardware barrier. The lack of explicit licensing is a critical adoption blocker.
3 months ago
Inactive
AIGC-Audio
CorentinJ