OuteTTS by edwko

TTS interface for unified text-to-speech, treating audio as language

Created 1 year ago

1,419 stars

Top 28.4% on SourcePulse

4 Experts Love This Project

abidlabs

Cofounder of Gradio

tjbck

Founder of Open WebUI

luiscape

Cofounder of Lightning AI

ggerganov

Georgi Gerganov

Author of llama.cpp, whisper.cpp

Project Summary

OuteTTS provides a unified interface for advanced Text-to-Speech models that treat audio as a language. It targets researchers and developers looking to integrate state-of-the-art TTS capabilities into their applications, offering flexible backend support and speaker cloning features.

How It Works

OuteTTS leverages a novel approach by treating audio generation as a sequence-to-sequence task, similar to natural language processing. It supports multiple backends, including llama.cpp and Hugging Face Transformers, allowing users to choose based on hardware and performance needs. The core advantage lies in its unified API, simplifying the integration of complex TTS models and enabling advanced features like speaker cloning and fine-grained sampling control.

Quick Start & Requirements

Install via pip: pip install outetts --upgrade
For CUDA (NVIDIA GPUs): CMAKE_ARGS="-DGGML_CUDA=on" pip install outetts --upgrade
For ROCm (AMD GPUs): CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install outetts --upgrade
For Vulkan: CMAKE_ARGS="-DGGML_VULKAN=on" pip install outetts --upgrade
For Metal (Apple Silicon): CMAKE_ARGS="-DGGML_METAL=on" pip install outetts --upgrade
Requires Python. GPU acceleration (CUDA, ROCm, Vulkan, Metal) is recommended for performance.
See: 🔗 interface_usage.md

Highlighted Details

Supports multiple TTS backends: llama.cpp, Hugging Face Transformers, ExLlamaV2, and Transformers.js.
Features speaker cloning for voice replication, inheriting emotion, style, and accent.
Recommends specific sampling configurations for optimal output quality, including windowed repetition penalties.
Optimal generation length is around 42 seconds (approx. 8,192 tokens), with best results up to 7,000 tokens.

Maintenance & Community

Active development with community support via Discord and X (Twitter).
Website, Hugging Face, and Blog links provided for further information.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

ExLlamaV2 backend requires manual installation.
The model may retain the accent of the reference speaker across different languages.
DAC audio reconstruction is lossy, and issues with speaker samples (clipping, loudness) can impact output quality.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

11 stars in the last 30 days

Explore Similar Projects

ControlSpeech by jishengpeng

Speech synthesis with simultaneous zero-shot speaker cloning and language style control

Created 1 year ago

Updated 1 year ago

WavJourney by Audio-AGI

Audio creation pipeline using LLMs for compositional generation

Created 2 years ago

Updated 2 years ago

sesame_csm_openai by phildougherty

OpenAI-compatible TTS API for voice cloning

Created 10 months ago

Updated 3 months ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo).

tango by declare-lab

Diffusion model family for text-to-audio generation

Created 2 years ago

Updated 5 months ago

Starred by

Georgi Gerganov

Georgi Gerganov(Author of llama.cpp, whisper.cpp).

orpheus-tts-local by isaiahbjork

Local client for text-to-speech using LM Studio API

Created 9 months ago

Updated 9 months ago

Lip2Wav by Rudrabha

Lip-to-speech synthesis for generating speech from lip movements

Created 5 years ago

Updated 2 years ago

Starred by

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm).

MARS5-TTS by Camb-ai

Speech model (TTS) for prosody generation

Created 1 year ago

Updated 1 year ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

AudioLDM by haoheliu

Audio generation research paper using latent diffusion

Created 2 years ago

Updated 6 months ago

alltalk_tts by erew123

Text-to-speech tool based on Coqui TTS engine

Created 2 years ago

Updated 2 days ago

Starred by

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs) and

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

audiolm-pytorch by lucidrains

PyTorch implementation of Google's AudioLM for audio generation

Created 3 years ago

Updated 1 year ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Didier Lopes

Didier Lopes(Founder of OpenBB).

Zonos by Zyphra

Open-weight text-to-speech model for expressive, high-quality speech generation

Created 11 months ago

Updated 10 months ago

Starred by

Alex Chen

Alex Chen(Cofounder of Nexa AI),

Amin Ahmad

Amin Ahmad(Cofounder of Vectara), and

4 more.

csm by SesameAILabs

Speech generation model for conversational AI research

Created 10 months ago

Updated 7 months ago

Feedback? Help us improve.