dots.tts  by rednote-hilab

Continuous autoregressive TTS with LLM backbone

Created 1 week ago

New!

473 stars

Top 63.9% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

dots.tts is a 2 billion parameter, fully continuous, end-to-end autoregressive text-to-speech system designed for high-fidelity voice synthesis and cloning. It addresses the need for natural, expressive, and stable TTS by eliminating discrete tokens and leveraging a novel architecture combining a semantic encoder, a large language model (LLM), and a diffusion transformer (DiT) acoustic head. This approach achieves state-of-the-art performance across multiple benchmarks, offering significant benefits for researchers and developers in speech synthesis, voice cloning, and multilingual applications.

How It Works

The system employs a unique pipeline: a frozen AudioVAE encodes 48 kHz audio into continuous latents, which are then processed by a semantic encoder and an LLM (initialized from Qwen2.5-1.5B-Base) that consumes BPE text directly. An autoregressive flow-matching head, a DiT conditioned on the LLM output and a speaker x-vector, denoises the latent representation patch-by-patch. This continuous, token-free autoregressive approach avoids the quantization artifacts common in other systems. It supports a "plain" mode for standard TTS and an "interleaved" 1T1A mode for low-latency streaming applications.

Quick Start & Requirements

Installation requires a conda environment with Python 3.10–3.12. Install via pip: python -m pip install -e . -c constraints/recommended.txt. For training extras, use python -m pip install -e .[full] -c constraints/recommended.txt. The package provides a CLI for inference, a Python API (DotsTtsRuntime), and a Gradio web demo. Pretrained checkpoints are available.

Highlighted Details

  • Achieves state-of-the-art results on Seed-TTS-Eval (e.g., 0.94% WER, 81.0 SIM on zh-hard).
  • Leads the MiniMax multilingual benchmark with an average speaker similarity (SIM) of 83.9 across 24 languages.
  • Demonstrates strong performance on CV3-Eval and EmergentTTS-Eval, with competitive expressiveness and syntactic complexity scores.
  • Offers robust zero-shot voice cloning (continuation and x-vector-only) and generation stability.

Maintenance & Community

The project is open-source under Apache-2.0. Community contributions include third-party ports for Apple Silicon (MLX, Swift) and integration with ComfyUI for TTS and voice cloning workflows. No specific maintainer details or sponsorships are listed.

Licensing & Compatibility

The code and released checkpoints are licensed under Apache-2.0, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

High-fidelity voice cloning carries a misuse risk for impersonation or fraud; users must implement consent-aware policies and content marking. A WER gap exists for low-resource languages due to the LLM's data appetite, though speaker similarity is maintained. The system is trained primarily on speech and does not cover singing or general sound generation.

Health Check
Last Commit

13 hours ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
12
Star History
473 stars in the last 8 days

Explore Similar Projects

Feedback? Help us improve.