kugelaudio-open by Kugelaudio

Advanced European language text-to-speech synthesis

Created 5 months ago

266 stars

Top 95.9% on SourcePulse

Project Summary

Summary

KugelAudio addresses the deficit in open-source text-to-speech (TTS) for European languages. It delivers state-of-the-art TTS, outperforming commercial leaders like ElevenLabs in human preference tests, by leveraging extensive European language data. This project offers researchers and developers high-fidelity, expressive speech synthesis for 24 European languages, including pre-encoded voice selection and emotional range capabilities.

How It Works

The system utilizes a hybrid AR + Diffusion architecture, built on Microsoft's VibeVoice. A Qwen2-based encoder processes text, feeding into a transformer TTS backbone. A diffusion head generates speech latents, decoded into audio waveforms. Trained on ~200,000 hours of YODAS2 data, this approach enables nuanced prosody and style control, achieving superior European language coverage and quality.

Quick Start & Requirements

Installation is simplified via uv (recommended) or pip, requiring Python 3.10+ and PyTorch 2.0+. CUDA is recommended for GPU acceleration. Training demands substantial hardware (8x NVIDIA H100 GPUs for 5 days), while inference VRAM is approximately 19GB for the 7B model. The project provides direct execution via uv run python start.py for setup and inference, with comprehensive documentation linked within the README.

Highlighted Details

State-of-the-Art Performance: Ranks highest in human preference evaluations (OpenSkill), surpassing ElevenLabs.
European Language Focus: Specifically trained for 24 major European languages, with strong representation for German, French, Spanish, and English.
Expressive Speech Synthesis: Supports diverse speaking styles including whispering, shouting, singing, and nuanced emotional tones.
Audio Watermarking: Integrates Facebook's AudioSeal for imperceptible AI-generated audio detection.
Pre-encoded Voices: Offers selection from high-quality, pre-encoded speaker voices.

Maintenance & Community

Led by Kajo Kratzenstein and Carlos Menke, the project is funded by the German Federal Ministry of Research. While explicit community channels are not detailed, HuggingFace is used for model hosting.

Licensing & Compatibility

Released under the permissive MIT License, KugelAudio allows for broad commercial use and integration into closed-source applications without significant restrictions.

Limitations & Caveats

Speech quality varies by language, with lower representation languages potentially showing reduced performance. Raw audio voice cloning is not supported; only pre-defined voices are available. Training requires significant computational resources.

kugelaudio-open by Kugelaudio

Explore Similar Projects

VoiceSculptor by ASLP-lab

qwen-tts-webui by licyk

ComfyUI-Qwen3-TTS by DarioFT

Habibi-TTS by SWivid

FireRedTTS by FireRedTeam

speech-synthesis-paper by wenet-e2e

ComfyUI-Qwen-TTS by flybirdxx

WhisperSpeech by WhisperSpeech

Zonos by Zyphra

KittenTTS by KittenML

OmniVoice by k2-fsa

GPT-SoVITS by RVC-Boss