voicebox by jamiepine

Local voice synthesis studio for private, professional audio production

Created 5 months ago

38,451 stars

Top 1.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Dan Guido

Cofounder of Trail of Bits

Project Summary

Voicebox is an open-source, local-first voice synthesis studio designed for cloning voices, generating speech, and building voice-powered applications. It offers a privacy-focused, professional-grade alternative to cloud-based services, allowing users to manage voice data and models entirely on their machine. The target audience includes developers, researchers, and content creators seeking granular control over voice synthesis without cloud dependencies.

How It Works

Voicebox employs a robust tech stack featuring Tauri (Rust) for a performant, low-memory desktop application, paired with a FastAPI (Python) backend. It leverages advanced models like Qwen3-TTS for high-fidelity voice cloning from minimal audio samples. A key differentiator is its inference engine: MLX with native Metal acceleration provides 4-5x faster generation on Apple Silicon, while PyTorch is used for Windows/Linux/Intel Macs, benefiting from CUDA GPUs. This architecture ensures local processing, privacy, and native performance.

Quick Start & Requirements

Primary install/run command: For Unix/macOS/Linux, use make setup followed by make dev. Manual setup involves bun install, cd backend && pip install -r requirements.txt, and bun run dev.
Prerequisites: Bun, Rust, Python 3.11+. CUDA GPU is recommended for Windows/Linux/Intel Mac.
Links: Official website: voicebox.sh. Contribution guidelines: CONTRIBUTING.md. Security details: SECURITY.md. API documentation is available at http://localhost:8000/docs when the server is running.
Estimated setup time or resource footprint: Not explicitly detailed, but the Tauri-based app is noted for being significantly smaller and more memory-efficient than Electron alternatives.

Highlighted Details

Near-perfect voice cloning from just a few seconds of audio using Qwen3-TTS.
Local-first, privacy-centric design, keeping voice data on the user's machine.
Professional audio editing features, including a multi-track timeline, trimming, and conversation mixing.
Significantly accelerated inference speeds on Apple Silicon (M1/M2/M3) via MLX and Metal.
API-first approach enables seamless integration into custom applications.

Maintenance & Community

The project includes CONTRIBUTING.md and SECURITY.md files, indicating structured processes for development and security. A roadmap is provided, suggesting ongoing development and future feature planning. No specific community channels like Discord or Slack are mentioned in the README.

Licensing & Compatibility

The project is released under the MIT License, which permits commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

Linux builds are currently unavailable due to GitHub runner disk space limitations. Support for additional voice models such as XTTS and Bark, along with advanced features like real-time synthesis and a word-level precision timeline editor, are planned for future releases.

Health Check

Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9,042 stars in the last 30 days