dots.tts by studio-dots-ai

Continuous autoregressive TTS with LLM backbone

Created 1 month ago

950 stars

Top 37.9% on SourcePulse

Project Summary

Summary

dots.tts is a 2 billion parameter, fully continuous, end-to-end autoregressive text-to-speech system designed for high-fidelity voice synthesis and cloning. It addresses the need for natural, expressive, and stable TTS by eliminating discrete tokens and leveraging a novel architecture combining a semantic encoder, a large language model (LLM), and a diffusion transformer (DiT) acoustic head. This approach achieves state-of-the-art performance across multiple benchmarks, offering significant benefits for researchers and developers in speech synthesis, voice cloning, and multilingual applications.

How It Works

The system employs a unique pipeline: a frozen AudioVAE encodes 48 kHz audio into continuous latents, which are then processed by a semantic encoder and an LLM (initialized from Qwen2.5-1.5B-Base) that consumes BPE text directly. An autoregressive flow-matching head, a DiT conditioned on the LLM output and a speaker x-vector, denoises the latent representation patch-by-patch. This continuous, token-free autoregressive approach avoids the quantization artifacts common in other systems. It supports a "plain" mode for standard TTS and an "interleaved" 1T1A mode for low-latency streaming applications.

Quick Start & Requirements

Installation requires a conda environment with Python 3.10–3.12. Install via pip: python -m pip install -e . -c constraints/recommended.txt. For training extras, use python -m pip install -e .[full] -c constraints/recommended.txt. The package provides a CLI for inference, a Python API (DotsTtsRuntime), and a Gradio web demo. Pretrained checkpoints are available.

Highlighted Details

Achieves state-of-the-art results on Seed-TTS-Eval (e.g., 0.94% WER, 81.0 SIM on zh-hard).
Leads the MiniMax multilingual benchmark with an average speaker similarity (SIM) of 83.9 across 24 languages.
Demonstrates strong performance on CV3-Eval and EmergentTTS-Eval, with competitive expressiveness and syntactic complexity scores.
Offers robust zero-shot voice cloning (continuation and x-vector-only) and generation stability.

Maintenance & Community

The project is open-source under Apache-2.0. Community contributions include third-party ports for Apple Silicon (MLX, Swift) and integration with ComfyUI for TTS and voice cloning workflows. No specific maintainer details or sponsorships are listed.

Licensing & Compatibility

The code and released checkpoints are licensed under Apache-2.0, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

High-fidelity voice cloning carries a misuse risk for impersonation or fraud; users must implement consent-aware policies and content marking. A WER gap exists for low-resource languages due to the LLM's data appetite, though speaker similarity is maintained. The system is trained primarily on speech and does not cover singing or general sound generation.

dots.tts by studio-dots-ai

Explore Similar Projects

LongCat-Audio-Codec by meituan-longcat

csm-mlx by senstella

unified-audio by alibaba

awesome-ai-voice by wildminder

VITA-Audio by VITA-MLLM

acestep.cpp by ServeurpersoCom

stable-audio-3 by Stability-AI

audiolm-pytorch by lucidrains

higgs-audio by boson-ai

Spark-TTS by SparkAudio

VoxCPM by OpenBMB

VibeVoice by microsoft