VibeVoice by microsoft

Frontier Text-to-Speech for long conversations

Created 10 months ago

50,015 stars

Top 0.7% on SourcePulse

View on GitHub

6 Experts Love This Project

Didier Lopes

Founder of OpenBB

Dan Guido

Cofounder of Trail of Bits

Luis Capelo

Cofounder of Lightning AI

Jason Huggins

Creator of Selenium

and 2 more!

Project Summary

VibeVoice is a novel framework for generating expressive, long-form, multi-speaker conversational audio from text, targeting applications like podcast creation. It addresses limitations in traditional Text-to-Speech (TTS) systems by enabling longer audio generation, maintaining speaker consistency, and facilitating natural turn-taking in conversations with multiple speakers.

How It Works

VibeVoice utilizes continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate (7.5 Hz) for efficient, high-fidelity audio processing. It employs a next-token diffusion framework, integrating a Large Language Model (LLM) for contextual understanding and dialogue flow, and a diffusion head for generating detailed acoustic features. This approach allows for the synthesis of up to 90 minutes of audio with up to four distinct speakers.

Quick Start & Requirements

Installation: Recommended via NVIDIA Deep Learning Container (e.g., nvcr.io/nvidia/pytorch:24.07-py3). Manual installation of flash attention may be required if not included. Alternatively, clone the repository and install via pip install -e ..
Prerequisites: NVIDIA GPU with CUDA, Docker, Python. Flash attention is recommended.
Demo: A live Gradio demo is available at https://aka.ms/VibeVoice-Demo.
Models: Available on Hugging Face (https://huggingface.co/collections/microsoft/vibevoice-68a2ef24a875c44be47b034f).

Highlighted Details

Synthesizes speech up to 90 minutes long with up to 4 distinct speakers.
Supports cross-lingual synthesis and spontaneous singing.
Utilizes ultra-low frame rate (7.5 Hz) continuous speech tokenizers for efficiency.

Maintenance & Community

Project Page: https://microsoft.github.io/VibeVoice
Technical Report: report/TechnicalReport.pdf

Licensing & Compatibility

The repository does not explicitly state a license. However, given it's from Microsoft and hosted on GitHub, it's likely to be permissive, but users should verify.

Limitations & Caveats

VibeVoice is intended for research and development purposes only and is not recommended for commercial or real-world applications without further testing. It has a high potential for misuse in creating deepfakes and disinformation. The model currently only supports English and Chinese transcripts, does not handle non-speech audio (background noise, music), and does not generate overlapping speech. Users are responsible for ensuring transcript reliability, content accuracy, and lawful deployment. Disclosure of AI-generated content is best practice.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

895 stars in the last 30 days