Dia-TTS-Server  by devnen

Self-host a powerful TTS model with an OpenAI-compatible API

Created 4 months ago
323 stars

Top 84.1% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a self-hostable server for the Dia TTS model, offering a user-friendly web UI and an OpenAI-compatible API for easy integration. It targets developers and power users needing advanced text-to-speech capabilities, including voice cloning and realistic dialogue generation, with significant improvements in speed and VRAM usage.

How It Works

The server leverages the FastAPI framework to expose Dia TTS functionalities. It intelligently chunks long text inputs for sequential processing and concatenation, improving handling of large documents. The project defaults to BF16 SafeTensors for reduced VRAM and faster inference, with support for original .pth weights. It automatically detects and utilizes NVIDIA CUDA for GPU acceleration, with a CPU fallback.

Quick Start & Requirements

  • Installation: Clone the repository, set up a Python virtual environment, and install dependencies via pip install -r requirements.txt. For GPU acceleration, ensure correct PyTorch with CUDA support is installed.
  • Prerequisites: Python 3.10+, Git, NVIDIA GPU (recommended for performance), CUDA Toolkit (if using GPU), libsndfile1 and ffmpeg (on Linux).
  • Docker: Pre-built images are available on GHCR. docker compose up -d provides a one-command setup.
  • Resources: Initial model downloads can be substantial (3-7GB). VRAM usage is approximately 7GB with BF16 SafeTensors.
  • Docs: https://github.com/devnen/dia-tts-server

Highlighted Details

  • OpenAI-compatible API endpoint (/v1/audio/speech).
  • Supports 43 built-in voices and improved voice cloning with automatic audio/transcript handling.
  • Intelligent large text chunking with configurable size and UI toggle.
  • Generation seed for reproducible results across chunks or requests.
  • Automatic audio post-processing for silence trimming and artifact removal.

Maintenance & Community

The project is actively maintained by devnen. Contributions are welcome via issues and pull requests.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

Whisper integration for transcript generation during voice cloning is experimental. Voice consistency across chunks in "Random/Dialogue" mode without a fixed seed may vary. The "UI Cancel" button stops frontend waiting but does not immediately halt backend inference.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
19 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.