dia by nari-labs

TTS model for ultra-realistic dialogue generation

Created 8 months ago

19,028 stars

Top 2.4% on SourcePulse

View on GitHub

14 Experts Love This Project

Inference Lead at SGLang; Research Scientist at Together AI

and 10 more!

Project Summary

Dia is a 1.6B parameter text-to-speech model designed for generating ultra-realistic, one-pass dialogue with fine-grained control over emotion and non-verbal cues. It targets researchers and developers seeking advanced TTS capabilities for English content, offering features like voice cloning and conditioning on audio prompts.

How It Works

Dia directly generates speech from text, allowing for conditioning on audio inputs to control tone and emotion. It supports specific non-verbal tags (e.g., (laughs), (coughs)) and uses speaker tags [S1] and [S2] to manage dialogue flow. The model leverages the Descript Audio Codec and is built for efficient inference on GPUs.

Quick Start & Requirements

Install via pip: pip install git+https://github.com/nari-labs/dia.git
Run Gradio UI: Clone repo, cd dia, then uv run app.py or python -m venv .venv && source .venv/bin/activate && pip install -e . && python app.py
Requirements: PyTorch 2.0+, CUDA 12.6+. Initial run requires downloading Descript Audio Codec.
Demo: ZeroGPU Space

Highlighted Details

Generates realistic dialogue with controllable emotion and non-verbal sounds.
Supports voice cloning via audio prompts.
Achieves ~2.2x real-time factor on RTX 4090 with float16 precision and torch.compile.
Requires ~10GB VRAM for float16 or bfloat16 inference.

Maintenance & Community

Active development by a small team.
Community support via Discord Server.

Licensing & Compatibility

Licensed under Apache License 2.0.
Permissive for commercial use and closed-source linking.

Limitations & Caveats

The model currently only supports English. Input text length should be moderate (under 5s to 20s of audio) to avoid unnatural speech. Overusing or misusing non-verbal tags may cause artifacts. Speaker consistency requires audio prompts or fixing the seed, as default runs produce varied voices. CPU support is planned but not yet available.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

122 stars in the last 30 days