Discover and explore top open-source AI tools and projects—updated daily.
KugelaudioAdvanced European language text-to-speech synthesis
Top 98.5% on SourcePulse
Summary
KugelAudio addresses the deficit in open-source text-to-speech (TTS) for European languages. It delivers state-of-the-art TTS, outperforming commercial leaders like ElevenLabs in human preference tests, by leveraging extensive European language data. This project offers researchers and developers high-fidelity, expressive speech synthesis for 24 European languages, including pre-encoded voice selection and emotional range capabilities.
How It Works
The system utilizes a hybrid AR + Diffusion architecture, built on Microsoft's VibeVoice. A Qwen2-based encoder processes text, feeding into a transformer TTS backbone. A diffusion head generates speech latents, decoded into audio waveforms. Trained on ~200,000 hours of YODAS2 data, this approach enables nuanced prosody and style control, achieving superior European language coverage and quality.
Quick Start & Requirements
Installation is simplified via uv (recommended) or pip, requiring Python 3.10+ and PyTorch 2.0+. CUDA is recommended for GPU acceleration. Training demands substantial hardware (8x NVIDIA H100 GPUs for 5 days), while inference VRAM is approximately 19GB for the 7B model. The project provides direct execution via uv run python start.py for setup and inference, with comprehensive documentation linked within the README.
Highlighted Details
Maintenance & Community
Led by Kajo Kratzenstein and Carlos Menke, the project is funded by the German Federal Ministry of Research. While explicit community channels are not detailed, HuggingFace is used for model hosting.
Licensing & Compatibility
Released under the permissive MIT License, KugelAudio allows for broad commercial use and integration into closed-source applications without significant restrictions.
Limitations & Caveats
Speech quality varies by language, with lower representation languages potentially showing reduced performance. Raw audio voice cloning is not supported; only pre-defined voices are available. Training requires significant computational resources.
3 months ago
Inactive
WhisperSpeech
RVC-Boss