kugelaudio-open  by Kugelaudio

Advanced European language text-to-speech synthesis

Created 4 months ago
256 stars

Top 98.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

KugelAudio addresses the deficit in open-source text-to-speech (TTS) for European languages. It delivers state-of-the-art TTS, outperforming commercial leaders like ElevenLabs in human preference tests, by leveraging extensive European language data. This project offers researchers and developers high-fidelity, expressive speech synthesis for 24 European languages, including pre-encoded voice selection and emotional range capabilities.

How It Works

The system utilizes a hybrid AR + Diffusion architecture, built on Microsoft's VibeVoice. A Qwen2-based encoder processes text, feeding into a transformer TTS backbone. A diffusion head generates speech latents, decoded into audio waveforms. Trained on ~200,000 hours of YODAS2 data, this approach enables nuanced prosody and style control, achieving superior European language coverage and quality.

Quick Start & Requirements

Installation is simplified via uv (recommended) or pip, requiring Python 3.10+ and PyTorch 2.0+. CUDA is recommended for GPU acceleration. Training demands substantial hardware (8x NVIDIA H100 GPUs for 5 days), while inference VRAM is approximately 19GB for the 7B model. The project provides direct execution via uv run python start.py for setup and inference, with comprehensive documentation linked within the README.

Highlighted Details

  • State-of-the-Art Performance: Ranks highest in human preference evaluations (OpenSkill), surpassing ElevenLabs.
  • European Language Focus: Specifically trained for 24 major European languages, with strong representation for German, French, Spanish, and English.
  • Expressive Speech Synthesis: Supports diverse speaking styles including whispering, shouting, singing, and nuanced emotional tones.
  • Audio Watermarking: Integrates Facebook's AudioSeal for imperceptible AI-generated audio detection.
  • Pre-encoded Voices: Offers selection from high-quality, pre-encoded speaker voices.

Maintenance & Community

Led by Kajo Kratzenstein and Carlos Menke, the project is funded by the German Federal Ministry of Research. While explicit community channels are not detailed, HuggingFace is used for model hosting.

Licensing & Compatibility

Released under the permissive MIT License, KugelAudio allows for broad commercial use and integration into closed-source applications without significant restrictions.

Limitations & Caveats

Speech quality varies by language, with lower representation languages potentially showing reduced performance. Raw audio voice cloning is not supported; only pre-defined voices are available. Training requires significant computational resources.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 30 days

Explore Similar Projects

Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
58k
Few-shot voice cloning and TTS web UI
Created 2 years ago
Updated 3 weeks ago
Feedback? Help us improve.