Discover and explore top open-source AI tools and projects—updated daily.
rednote-hilabContinuous autoregressive TTS with LLM backbone
New!
Top 63.9% on SourcePulse
Summary
dots.tts is a 2 billion parameter, fully continuous, end-to-end autoregressive text-to-speech system designed for high-fidelity voice synthesis and cloning. It addresses the need for natural, expressive, and stable TTS by eliminating discrete tokens and leveraging a novel architecture combining a semantic encoder, a large language model (LLM), and a diffusion transformer (DiT) acoustic head. This approach achieves state-of-the-art performance across multiple benchmarks, offering significant benefits for researchers and developers in speech synthesis, voice cloning, and multilingual applications.
How It Works
The system employs a unique pipeline: a frozen AudioVAE encodes 48 kHz audio into continuous latents, which are then processed by a semantic encoder and an LLM (initialized from Qwen2.5-1.5B-Base) that consumes BPE text directly. An autoregressive flow-matching head, a DiT conditioned on the LLM output and a speaker x-vector, denoises the latent representation patch-by-patch. This continuous, token-free autoregressive approach avoids the quantization artifacts common in other systems. It supports a "plain" mode for standard TTS and an "interleaved" 1T1A mode for low-latency streaming applications.
Quick Start & Requirements
Installation requires a conda environment with Python 3.10–3.12. Install via pip: python -m pip install -e . -c constraints/recommended.txt. For training extras, use python -m pip install -e .[full] -c constraints/recommended.txt. The package provides a CLI for inference, a Python API (DotsTtsRuntime), and a Gradio web demo. Pretrained checkpoints are available.
Highlighted Details
Maintenance & Community
The project is open-source under Apache-2.0. Community contributions include third-party ports for Apple Silicon (MLX, Swift) and integration with ComfyUI for TTS and voice cloning workflows. No specific maintainer details or sponsorships are listed.
Licensing & Compatibility
The code and released checkpoints are licensed under Apache-2.0, permitting commercial use and integration into closed-source projects.
Limitations & Caveats
High-fidelity voice cloning carries a misuse risk for impersonation or fraud; users must implement consent-aware policies and content marking. A WER gap exists for low-resource languages due to the LLM's data appetite, though speaker similarity is maintained. The system is trained primarily on speech and does not cover singing or general sound generation.
13 hours ago
Inactive
lucidrains