TTS model for ultra-realistic dialogue generation
Top 2.5% on sourcepulse
Dia is a 1.6B parameter text-to-speech model designed for generating ultra-realistic, one-pass dialogue with fine-grained control over emotion and non-verbal cues. It targets researchers and developers seeking advanced TTS capabilities for English content, offering features like voice cloning and conditioning on audio prompts.
How It Works
Dia directly generates speech from text, allowing for conditioning on audio inputs to control tone and emotion. It supports specific non-verbal tags (e.g., (laughs)
, (coughs)
) and uses speaker tags [S1]
and [S2]
to manage dialogue flow. The model leverages the Descript Audio Codec and is built for efficient inference on GPUs.
Quick Start & Requirements
pip install git+https://github.com/nari-labs/dia.git
cd dia
, then uv run app.py
or python -m venv .venv && source .venv/bin/activate && pip install -e . && python app.py
Highlighted Details
float16
precision and torch.compile
.float16
or bfloat16
inference.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The model currently only supports English. Input text length should be moderate (under 5s to 20s of audio) to avoid unnatural speech. Overusing or misusing non-verbal tags may cause artifacts. Speaker consistency requires audio prompts or fixing the seed, as default runs produce varied voices. CPU support is planned but not yet available.
3 weeks ago
Inactive