OmniVoice  by k2-fsa

State-of-the-art multilingual TTS for voice cloning and design

Created 1 week ago

New!

2,917 stars

Top 16.1% on SourcePulse

GitHubView on GitHub
Project Summary

OmniVoice is a state-of-the-art, zero-shot multilingual text-to-speech (TTS) model designed for high-quality voice cloning and synthesis across over 600 languages. It targets researchers, developers, and power users seeking broad language support, advanced voice manipulation capabilities, and rapid inference speeds for applications ranging from content creation to accessibility tools. The model offers significant benefits in terms of language coverage and voice customization without requiring extensive training data for new voices.

How It Works

OmniVoice is built upon a novel diffusion language model architecture, which enables it to generate high-quality speech efficiently. This approach allows for a streamlined, scalable design that balances both audio fidelity and inference speed. The model supports advanced features like zero-shot voice cloning from short audio samples and voice design through controllable speaker attributes, offering a flexible and powerful TTS generation pipeline.

Quick Start & Requirements

  • Installation: Install via pip (stable release: pip install omnivoice, latest source: pip install git+https://github.com/k2-fsa/OmniVoice.git) or uv (clone repo, uv sync). Requires PyTorch installation tailored to your CUDA version (e.g., pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128) or Apple Silicon (pip install torch==2.8.0 torchaudio==2.8.0).
  • Prerequisites: NVIDIA GPU with compatible CUDA version (e.g., 12.8) or Apple Silicon. Python environment.
  • Resource Footprint: Achieves Real-Time Factor (RTF) as low as 0.025 (40x faster than real-time).
  • Links: PyTorch Official Site, HuggingFace Space (for demo and pre-trained models).

Highlighted Details

  • Supports over 600 languages, offering the broadest language coverage among zero-shot TTS models.
  • Enables state-of-the-art zero-shot voice cloning and voice design with control over attributes like gender, age, pitch, and accent.
  • Features extremely fast inference speeds, with an RTF as low as 0.025.
  • Incorporates inline non-verbal symbols (e.g., [laughter]) and pronunciation control for enhanced expressiveness.

Maintenance & Community

Discussions are primarily handled via GitHub Issues. Community engagement also includes WeChat groups and an official account, accessible via QR codes in the README. No specific information on core maintainers, sponsorships, or partnerships is provided.

Licensing & Compatibility

The repository README does not explicitly state a software license. This absence makes it impossible to determine compatibility for commercial use, closed-source linking, or other deployment scenarios without further clarification.

Limitations & Caveats

The most significant limitation is the lack of a specified open-source license, creating uncertainty regarding usage rights and commercial viability. Installation requires specific PyTorch and CUDA versions, and users may encounter issues downloading pre-trained models from HuggingFace without setting the HF_ENDPOINT environment variable. The project is presented as state-of-the-art, but specific benchmarks beyond RTF are not detailed.

Health Check
Last Commit

22 hours ago

Responsiveness

Inactive

Pull Requests (30d)
13
Issues (30d)
72
Star History
2,926 stars in the last 11 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), and
6 more.

OpenVoice by myshell-ai

0.1%
36k
Audio foundation model for versatile, instant voice cloning
Created 2 years ago
Updated 11 months ago
Feedback? Help us improve.