dia2 by nari-labs

Streaming dialogue TTS for real-time conversational audio

Created 3 months ago

1,092 stars

Top 34.8% on SourcePulse

View on GitHub

3 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Theo Browne

Founder of Ping.gg

Luis Capelo

Cofounder of Lightning AI

Project Summary

Dia2 is a streaming dialogue Text-to-Speech (TTS) model designed for real-time conversational audio generation. It addresses the need for low-latency TTS that can begin producing audio as input text is received, enabling more natural and interactive dialogue systems. The model is beneficial for researchers and developers building real-time conversational AI, virtual assistants, and speech-to-speech applications.

How It Works

The core approach is a streaming dialogue TTS architecture that processes text incrementally. This allows audio generation to commence immediately upon receiving the initial words, rather than waiting for the complete utterance. A key feature is its ability to condition output on audio inputs, such as speaker voice samples or previous conversational turns, facilitating more natural and contextually relevant speech generation for dynamic interactions.

Quick Start & Requirements

Primary install/run command: Install uv first, then run dependencies with uv sync. Commands are executed via uv run ....
Non-default prerequisites: CUDA 12.8+ drivers are required.
Setup time/resource footprint: The first run downloads model weights and tokenizer. The CLI defaults to bfloat16 precision and auto-selects CUDA if available.
Relevant pages: Hugging Face Spaces (for demos), Discord (for community support).

Highlighted Details

Streaming Generation: Produces audio incrementally as text is input, supporting real-time applications.
Conditional Generation: Enables stable and natural output by conditioning on audio prefixes (e.g., speaker identity, user audio), crucial for speech-to-speech systems.
Model Variants: Offers 1B and 2B parameter checkpoints (nari-labs/Dia2-1B, nari-labs/Dia2-2B).
Real-time Optimization: Includes options like --cuda-graph for performance.
Speech-to-Speech Engine: Powers related projects like the Sori speech-to-speech engine.

Maintenance & Community

Questions can be directed to the project's Discord server, and issues can be opened on the repository. Compute for training was provided by the TPU Research Cloud program.

Licensing & Compatibility

The project is licensed under Apache 2.0. Third-party assets retain their original licenses. Apache 2.0 is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Generation is limited to a maximum of 2 minutes per call. Output quality and voice consistency can vary without prefix conditioning or fine-tuning. The project strictly forbids identity misuse, deceptive content generation, and illegal or malicious use. Transcription of prefix audio files using Whisper adds latency to conditional generation.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

35 stars in the last 30 days