Voice-Clone-Studio by FranckyB

Gradio web UI for advanced voice cloning and design

Created 2 weeks ago

New!

284 stars

Top 92.4% on SourcePulse

Project Summary

This project provides a Gradio-based web UI for advanced voice cloning and voice design, leveraging Qwen3-TTS and VibeVoice for speech synthesis and Whisper or VibeVoice-ASR for transcription. It targets engineers and researchers seeking a flexible tool for generating custom speech, creating multi-speaker dialogues, and exploring voice design from natural language descriptions, offering a powerful yet accessible platform for synthetic media creation.

How It Works

Voice Clone Studio integrates multiple state-of-the-art models within a user-friendly Gradio interface. It utilizes Qwen3-TTS for generating speech from text, offering both fast preset voices and advanced voice design capabilities based on descriptive prompts. For higher quality and longer audio generation, VibeVoice is incorporated, supporting custom voice cloning from user-provided samples and up to 90 minutes of continuous speech. Automatic speech recognition is handled by either OpenAI's Whisper or VibeVoice-ASR, ensuring seamless transcription of reference audio. Key features include voice prompt caching for faster subsequent generations and seed control for reproducible outputs.

Quick Start & Requirements

Installation involves cloning the repository (git clone https://github.com/FranckyB/Voice-Clone-Studio.git) and running a setup script (setup.bat on Windows) or manual setup. Prerequisites include Python 3.12+, a CUDA-compatible GPU (8GB+ VRAM recommended), SOX, and FFMPEG. The UI is launched via python voice_clone_studio.py or launch.bat. Optional Flash Attention 2 can improve performance.

Highlighted Details

Voice Cloning: Create custom voices from short audio samples (3-10 seconds) paired with their exact transcripts.
Voice Design: Generate unique voices using Qwen3-TTS by describing desired characteristics like age, gender, accent, and emotion.
Multi-Speaker Conversations: Construct dialogues using Qwen's 9 preset voices or up to 4 custom VibeVoice samples, with VibeVoice supporting up to 90 minutes of continuous speech.
Dual TTS Engines: Choose between Qwen (fast, preset voices) and VibeVoice (high-quality, custom, long-form) for speech synthesis.
Flexible Transcription: Supports both Whisper and VibeVoice-ASR for automatic audio transcription.
Audio Preparation: Integrated tools for trimming, normalizing, and converting audio to mono.
Performance Features: Voice prompt caching speeds up repeated generations, and seed control ensures reproducible results.

Maintenance & Community

The provided README does not contain specific details regarding notable contributors, sponsorships, community channels (like Discord or Slack), or a public roadmap.

Licensing & Compatibility

The project is licensed under the Apache License 2.0. Its core components also use the Apache 2.0 license (Qwen3-TTS, Gradio) and the MIT license (VibeVoice, Whisper). These permissive licenses generally allow for commercial use and integration into closed-source projects.

Limitations & Caveats

The VibeVoice engine may spontaneously add background music or sounds for realism, and it does not support style instructions. A CUDA-compatible GPU with sufficient VRAM is essential for optimal performance, particularly with larger models. External dependencies like SOX and FFMPEG must be installed separately.

Health Check

Last Commit

23 hours ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

286 stars in the last 17 days