VibeVoice  by vibevoice-community

Long-form conversational TTS synthesis

Created 2 weeks ago

New!

397 stars

Top 72.7% on SourcePulse

GitHubView on GitHub
Project Summary

VibeVoice is a novel framework for generating expressive, long-form, multi-speaker conversational audio from text. It addresses scalability and speaker consistency challenges in traditional TTS, offering high-fidelity audio up to 90 minutes with multiple speakers by leveraging efficient, ultra-low frame rate tokenization and a diffusion model.

How It Works

The core innovation lies in continuous speech tokenizers (Acoustic, Semantic) operating at 7.5 Hz, preserving fidelity while boosting efficiency for long sequences. A next-token diffusion framework integrates an LLM for context and a diffusion head for acoustic details, enabling synthesis of up to 90-minute audio with up to 4 speakers.

Quick Start & Requirements

Installation involves Docker with NVIDIA PyTorch containers (e.g., nvcr.io/nvidia/pytorch:24.07-py3), potentially needing manual flash-attention setup, or cloning the repo (git clone https://github.com/vibevoice-community/VibeVoice.git) and pip install -e .. Requires an NVIDIA GPU and CUDA. A Colab notebook is available for the 1.5B model.

Highlighted Details

  • Long-Form & Multi-Speaker: Synthesizes up to 90 minutes of audio with support for 4 distinct speakers.
  • Efficient Tokenization: Utilizes 7.5 Hz continuous speech tokenizers for enhanced computational efficiency.
  • Community Fork: Preserves codebase and adds features post-official repo removal.
  • Open-Sourced Weights: VibeVoice-7B model weights are available.

Maintenance & Community

This is a community-maintained fork established after the official repository's removal. A Discord server (https://discord.gg/ZDEYTTRxWG) facilitates community interaction. Upcoming features include unofficial training code.

Licensing & Compatibility

The specific license is not detailed. The project is intended solely for research and development, with explicit advisories against commercial or real-world application use without further testing.

Limitations & Caveats

Potential instability exists with Chinese speech; English punctuation and the Large model are recommended. Spontaneous background music generation is an emergent feature. Singing capability is emergent and may be off-key. Cross-lingual transfer can be unstable. The model inherits biases from its base LLM (Qwen2.5 1.5b). High potential for misuse in deepfakes and disinformation exists. Support is limited to English and Chinese; other languages may yield unexpected outputs. Overlapping speech is not supported.

Health Check
Last Commit

22 hours ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
4
Star History
401 stars in the last 14 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
2 more.

ChatTTS by 2noise

0.2%
38k
Generative speech model for daily dialogue
Created 1 year ago
Updated 2 months ago
Feedback? Help us improve.