dia  by nari-labs

TTS model for ultra-realistic dialogue generation

Created 5 months ago
18,396 stars

Top 2.4% on SourcePulse

GitHubView on GitHub
Project Summary

Dia is a 1.6B parameter text-to-speech model designed for generating ultra-realistic, one-pass dialogue with fine-grained control over emotion and non-verbal cues. It targets researchers and developers seeking advanced TTS capabilities for English content, offering features like voice cloning and conditioning on audio prompts.

How It Works

Dia directly generates speech from text, allowing for conditioning on audio inputs to control tone and emotion. It supports specific non-verbal tags (e.g., (laughs), (coughs)) and uses speaker tags [S1] and [S2] to manage dialogue flow. The model leverages the Descript Audio Codec and is built for efficient inference on GPUs.

Quick Start & Requirements

  • Install via pip: pip install git+https://github.com/nari-labs/dia.git
  • Run Gradio UI: Clone repo, cd dia, then uv run app.py or python -m venv .venv && source .venv/bin/activate && pip install -e . && python app.py
  • Requirements: PyTorch 2.0+, CUDA 12.6+. Initial run requires downloading Descript Audio Codec.
  • Demo: ZeroGPU Space

Highlighted Details

  • Generates realistic dialogue with controllable emotion and non-verbal sounds.
  • Supports voice cloning via audio prompts.
  • Achieves ~2.2x real-time factor on RTX 4090 with float16 precision and torch.compile.
  • Requires ~10GB VRAM for float16 or bfloat16 inference.

Maintenance & Community

  • Active development by a small team.
  • Community support via Discord Server.

Licensing & Compatibility

  • Licensed under Apache License 2.0.
  • Permissive for commercial use and closed-source linking.

Limitations & Caveats

The model currently only supports English. Input text length should be moderate (under 5s to 20s of audio) to avoid unnatural speech. Overusing or misusing non-verbal tags may cause artifacts. Speaker consistency requires audio prompts or fixing the seed, as default runs produce varied voices. CPU support is planned but not yet available.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
5
Star History
386 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
2 more.

ChatTTS by 2noise

0.2%
38k
Generative speech model for daily dialogue
Created 1 year ago
Updated 2 months ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.