dia2  by nari-labs

Streaming dialogue TTS for real-time conversational audio

Created 1 week ago

New!

651 stars

Top 51.4% on SourcePulse

GitHubView on GitHub
Project Summary

Dia2 is a streaming dialogue Text-to-Speech (TTS) model designed for real-time conversational audio generation. It addresses the need for low-latency TTS that can begin producing audio as input text is received, enabling more natural and interactive dialogue systems. The model is beneficial for researchers and developers building real-time conversational AI, virtual assistants, and speech-to-speech applications.

How It Works

The core approach is a streaming dialogue TTS architecture that processes text incrementally. This allows audio generation to commence immediately upon receiving the initial words, rather than waiting for the complete utterance. A key feature is its ability to condition output on audio inputs, such as speaker voice samples or previous conversational turns, facilitating more natural and contextually relevant speech generation for dynamic interactions.

Quick Start & Requirements

  • Primary install/run command: Install uv first, then run dependencies with uv sync. Commands are executed via uv run ....
  • Non-default prerequisites: CUDA 12.8+ drivers are required.
  • Setup time/resource footprint: The first run downloads model weights and tokenizer. The CLI defaults to bfloat16 precision and auto-selects CUDA if available.
  • Relevant pages: Hugging Face Spaces (for demos), Discord (for community support).

Highlighted Details

  • Streaming Generation: Produces audio incrementally as text is input, supporting real-time applications.
  • Conditional Generation: Enables stable and natural output by conditioning on audio prefixes (e.g., speaker identity, user audio), crucial for speech-to-speech systems.
  • Model Variants: Offers 1B and 2B parameter checkpoints (nari-labs/Dia2-1B, nari-labs/Dia2-2B).
  • Real-time Optimization: Includes options like --cuda-graph for performance.
  • Speech-to-Speech Engine: Powers related projects like the Sori speech-to-speech engine.

Maintenance & Community

Questions can be directed to the project's Discord server, and issues can be opened on the repository. Compute for training was provided by the TPU Research Cloud program.

Licensing & Compatibility

The project is licensed under Apache 2.0. Third-party assets retain their original licenses. Apache 2.0 is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Generation is limited to a maximum of 2 minutes per call. Output quality and voice consistency can vary without prefix conditioning or fine-tuning. The project strictly forbids identity misuse, deceptive content generation, and illegal or malicious use. Transcription of prefix audio files using Whisper adds latency to conditional generation.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
2
Star History
676 stars in the last 13 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
3 more.

ChatTTS by 2noise

0.1%
38k
Generative speech model for daily dialogue
Created 1 year ago
Updated 3 days ago
Feedback? Help us improve.