OmniVoice by k2-fsa

State-of-the-art multilingual TTS for voice cloning and design

Created 3 months ago

8,134 stars

Top 6.3% on SourcePulse

Project Summary

OmniVoice is a state-of-the-art, zero-shot multilingual text-to-speech (TTS) model designed for high-quality voice cloning and synthesis across over 600 languages. It targets researchers, developers, and power users seeking broad language support, advanced voice manipulation capabilities, and rapid inference speeds for applications ranging from content creation to accessibility tools. The model offers significant benefits in terms of language coverage and voice customization without requiring extensive training data for new voices.

How It Works

OmniVoice is built upon a novel diffusion language model architecture, which enables it to generate high-quality speech efficiently. This approach allows for a streamlined, scalable design that balances both audio fidelity and inference speed. The model supports advanced features like zero-shot voice cloning from short audio samples and voice design through controllable speaker attributes, offering a flexible and powerful TTS generation pipeline.

Quick Start & Requirements

Installation: Install via pip (stable release: pip install omnivoice, latest source: pip install git+https://github.com/k2-fsa/OmniVoice.git) or uv (clone repo, uv sync). Requires PyTorch installation tailored to your CUDA version (e.g., pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128) or Apple Silicon (pip install torch==2.8.0 torchaudio==2.8.0).
Prerequisites: NVIDIA GPU with compatible CUDA version (e.g., 12.8) or Apple Silicon. Python environment.
Resource Footprint: Achieves Real-Time Factor (RTF) as low as 0.025 (40x faster than real-time).
Links: PyTorch Official Site, HuggingFace Space (for demo and pre-trained models).

Highlighted Details

Supports over 600 languages, offering the broadest language coverage among zero-shot TTS models.
Enables state-of-the-art zero-shot voice cloning and voice design with control over attributes like gender, age, pitch, and accent.
Features extremely fast inference speeds, with an RTF as low as 0.025.
Incorporates inline non-verbal symbols (e.g., [laughter]) and pronunciation control for enhanced expressiveness.

Maintenance & Community

Discussions are primarily handled via GitHub Issues. Community engagement also includes WeChat groups and an official account, accessible via QR codes in the README. No specific information on core maintainers, sponsorships, or partnerships is provided.

Licensing & Compatibility

The repository README does not explicitly state a software license. This absence makes it impossible to determine compatibility for commercial use, closed-source linking, or other deployment scenarios without further clarification.

Limitations & Caveats

The most significant limitation is the lack of a specified open-source license, creating uncertainty regarding usage rights and commercial viability. Installation requires specific PyTorch and CUDA versions, and users may encounter issues downloading pre-trained models from HuggingFace without setting the HF_ENDPOINT environment variable. The project is presented as state-of-the-art, but specific benchmarks beyond RTF are not detailed.

OmniVoice by k2-fsa

Explore Similar Projects

VoiceSculptor by ASLP-lab

ComfyUI-Qwen3-TTS by DarioFT

ZONOS2 by Zyphra

kugelaudio-open by Kugelaudio

Cross-Lingual-Voice-Cloning by deterministic-algorithms-lab

Confucius4-TTS by netease-youdao

FireRedTTS by FireRedTeam

ComfyUI-Qwen-TTS by flybirdxx

Step-Audio by stepfun-ai

Zonos by Zyphra

Qwen3-TTS by QwenLM

OpenVoice by myshell-ai