OuteTTS  by edwko

TTS interface for unified text-to-speech, treating audio as language

created 9 months ago
1,342 stars

Top 30.5% on sourcepulse

GitHubView on GitHub
Project Summary

OuteTTS provides a unified interface for advanced Text-to-Speech models that treat audio as a language. It targets researchers and developers looking to integrate state-of-the-art TTS capabilities into their applications, offering flexible backend support and speaker cloning features.

How It Works

OuteTTS leverages a novel approach by treating audio generation as a sequence-to-sequence task, similar to natural language processing. It supports multiple backends, including llama.cpp and Hugging Face Transformers, allowing users to choose based on hardware and performance needs. The core advantage lies in its unified API, simplifying the integration of complex TTS models and enabling advanced features like speaker cloning and fine-grained sampling control.

Quick Start & Requirements

  • Install via pip: pip install outetts --upgrade
  • For CUDA (NVIDIA GPUs): CMAKE_ARGS="-DGGML_CUDA=on" pip install outetts --upgrade
  • For ROCm (AMD GPUs): CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install outetts --upgrade
  • For Vulkan: CMAKE_ARGS="-DGGML_VULKAN=on" pip install outetts --upgrade
  • For Metal (Apple Silicon): CMAKE_ARGS="-DGGML_METAL=on" pip install outetts --upgrade
  • Requires Python. GPU acceleration (CUDA, ROCm, Vulkan, Metal) is recommended for performance.
  • See: 🔗 interface_usage.md

Highlighted Details

  • Supports multiple TTS backends: llama.cpp, Hugging Face Transformers, ExLlamaV2, and Transformers.js.
  • Features speaker cloning for voice replication, inheriting emotion, style, and accent.
  • Recommends specific sampling configurations for optimal output quality, including windowed repetition penalties.
  • Optimal generation length is around 42 seconds (approx. 8,192 tokens), with best results up to 7,000 tokens.

Maintenance & Community

  • Active development with community support via Discord and X (Twitter).
  • Website, Hugging Face, and Blog links provided for further information.

Licensing & Compatibility

  • The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • ExLlamaV2 backend requires manual installation.
  • The model may retain the accent of the reference speaker across different languages.
  • DAC audio reconstruction is lossy, and issues with speaker samples (clipping, loudness) can impact output quality.
Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
1
Star History
145 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.