Speech-AI-Forge  by lenML

TTS API server and Gradio WebUI

created 1 year ago
1,311 stars

Top 31.2% on sourcepulse

GitHubView on GitHub
Project Summary

Speech-AI-Forge is a comprehensive toolkit for Text-to-Speech (TTS) generation, offering a robust API server and an interactive Gradio WebUI. It targets developers and researchers seeking to integrate advanced TTS capabilities into their applications, providing features like multi-model support, voice cloning, and SSML integration for fine-grained control over speech synthesis.

How It Works

The project acts as a unified inference framework, abstracting the complexities of various TTS models including ChatTTS, CosyVoice, FishSpeech, and others. It supports both streaming and sentence-level synthesis, with an emphasis on flexible voice management, including custom voice uploads, reference audio cloning, and a dedicated "Voice Builder" for creating new voice models. An integrated Automatic Speech Recognition (ASR) component leverages Whisper for speech-to-text tasks.

Quick Start & Requirements

  • Installation: Manual model download via python -m scripts.download_models --source huggingface is required before running.
  • Prerequisites: Python, PyTorch. Specific models may have additional dependencies. GPU acceleration is highly recommended for performance.
  • Running:
    • WebUI: python webui.py
    • API Server: python launch.py
  • Documentation: Installation and Running

Highlighted Details

  • Supports multiple TTS models (ChatTTS, CosyVoice, FishSpeech, F5-TTS, etc.) and ASR (Whisper).
  • Features advanced voice cloning via reference audio and a "Voice Builder" for custom voice creation.
  • Includes SSML support for detailed control over speech synthesis, with a dedicated script editor.
  • Offers a voice enhancer and post-processing tools for optimizing audio output.

Maintenance & Community

  • Active development with ongoing feature additions and model integrations.
  • Community support available via Discord Server.

Licensing & Compatibility

  • The project itself appears to be under a permissive license, but individual model licenses should be checked for compatibility. The README does not explicitly state a project-wide license.

Limitations & Caveats

  • Model download is a manual process.
  • Some features, like the SenseVoice ASR and GPT-SoVITS TTS, are marked as "in development" (🚧).
  • The --compile flag is not recommended due to potential performance issues with dynamic shapes.
Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
9
Star History
117 stars in the last 90 days

Explore Similar Projects

Starred by Michael Han Michael Han(Cofounder of Unsloth), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

TTS by coqui-ai

0.4%
42k
Deep learning toolkit for Text-to-Speech, research-tested
created 5 years ago
updated 11 months ago
Feedback? Help us improve.