VibeVoice  by microsoft

Frontier Text-to-Speech for long conversations

Created 2 months ago
7,633 stars

Top 6.8% on SourcePulse

GitHubView on GitHub
Project Summary

VibeVoice is a novel framework for generating expressive, long-form, multi-speaker conversational audio from text, targeting applications like podcast creation. It addresses limitations in traditional Text-to-Speech (TTS) systems by enabling longer audio generation, maintaining speaker consistency, and facilitating natural turn-taking in conversations with multiple speakers.

How It Works

VibeVoice utilizes continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate (7.5 Hz) for efficient, high-fidelity audio processing. It employs a next-token diffusion framework, integrating a Large Language Model (LLM) for contextual understanding and dialogue flow, and a diffusion head for generating detailed acoustic features. This approach allows for the synthesis of up to 90 minutes of audio with up to four distinct speakers.

Quick Start & Requirements

Highlighted Details

  • Synthesizes speech up to 90 minutes long with up to 4 distinct speakers.
  • Supports cross-lingual synthesis and spontaneous singing.
  • Utilizes ultra-low frame rate (7.5 Hz) continuous speech tokenizers for efficiency.

Maintenance & Community

Licensing & Compatibility

  • The repository does not explicitly state a license. However, given it's from Microsoft and hosted on GitHub, it's likely to be permissive, but users should verify.

Limitations & Caveats

VibeVoice is intended for research and development purposes only and is not recommended for commercial or real-world applications without further testing. It has a high potential for misuse in creating deepfakes and disinformation. The model currently only supports English and Chinese transcripts, does not handle non-speech audio (background noise, music), and does not generate overlapping speech. Users are responsible for ensuring transcript reliability, content accuracy, and lawful deployment. Disclosure of AI-generated content is best practice.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
305 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.