VibeVoice  by microsoft

Frontier Text-to-Speech for long conversations

Created 3 weeks ago

New!

7,633 stars

Top 6.8% on SourcePulse

GitHubView on GitHub
Project Summary

VibeVoice is a novel framework for generating expressive, long-form, multi-speaker conversational audio from text, targeting applications like podcast creation. It addresses limitations in traditional Text-to-Speech (TTS) systems by enabling longer audio generation, maintaining speaker consistency, and facilitating natural turn-taking in conversations with multiple speakers.

How It Works

VibeVoice utilizes continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate (7.5 Hz) for efficient, high-fidelity audio processing. It employs a next-token diffusion framework, integrating a Large Language Model (LLM) for contextual understanding and dialogue flow, and a diffusion head for generating detailed acoustic features. This approach allows for the synthesis of up to 90 minutes of audio with up to four distinct speakers.

Quick Start & Requirements

Highlighted Details

  • Synthesizes speech up to 90 minutes long with up to 4 distinct speakers.
  • Supports cross-lingual synthesis and spontaneous singing.
  • Utilizes ultra-low frame rate (7.5 Hz) continuous speech tokenizers for efficiency.

Maintenance & Community

Licensing & Compatibility

  • The repository does not explicitly state a license. However, given it's from Microsoft and hosted on GitHub, it's likely to be permissive, but users should verify.

Limitations & Caveats

VibeVoice is intended for research and development purposes only and is not recommended for commercial or real-world applications without further testing. It has a high potential for misuse in creating deepfakes and disinformation. The model currently only supports English and Chinese transcripts, does not handle non-speech audio (background noise, music), and does not generate overlapping speech. Users are responsible for ensuring transcript reliability, content accuracy, and lawful deployment. Disclosure of AI-generated content is best practice.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
37
Issues (30d)
65
Star History
8,911 stars in the last 24 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
2 more.

AudioGPT by AIGC-Audio

0.0%
10k
Audio processing and generation research project
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.