FireRedTTS2  by FireRedTeam

Streaming TTS for natural, long-form dialogue

Created 2 weeks ago

New!

508 stars

Top 61.4% on SourcePulse

GitHubView on GitHub
Project Summary

FireRedTTS-2 provides a long-form streaming Text-to-Speech (TTS) system for multi-speaker dialogue generation, delivering stable, natural speech with context-aware prosody. It targets researchers and developers in conversational AI, podcasting, and chatbot development, offering high-quality, low-latency synthesis with advanced zero-shot voice cloning and multilingual capabilities.

How It Works

The system utilizes a novel dual-transformer architecture with a 12.5Hz streaming speech tokenizer, enabling flexible, sentence-by-sentence generation and ultra-low first-packet latency (as low as 140ms on an L20 GPU). It supports long conversational speech (3 mins, 4 speakers, scalable) and offers multilingual capabilities across 7 languages, including zero-shot voice cloning for cross-lingual and code-switching scenarios. Random timbre generation is also supported.

Quick Start & Requirements

Installation requires cloning the repo, setting up a Python 3.11 Conda environment, and installing PyTorch with CUDA 12.6 support. Dependencies are managed via requirements.txt. Pre-trained models are available via Git LFS from Hugging Face. A Gradio web UI demo is provided for easy generation (python gradio_demo.py).

Highlighted Details

  • Long Conversational Speech: Generates up to 3 minutes of dialogue with 4 speakers, scalable.
  • Multilingual & Zero-Shot Cloning: Supports 7 languages with cross-lingual and code-switching voice cloning.
  • Ultra-Low Latency: Achieves first-packet latency as low as 140ms on an L20 GPU.
  • Random Timbre Generation: Facilitates synthetic data creation.
  • High Stability: Demonstrates high speaker similarity and low WER/CER.

Maintenance & Community

The roadmap includes releasing an enhanced multilingual model, fine-tuning code, and an end-to-end text-to-blog pipeline in October 2025. No specific community channels or contributor details are listed.

Licensing & Compatibility

No explicit license is stated. A disclaimer restricts zero-shot voice cloning strictly to academic research purposes, prohibiting illegal activities, implying a non-commercial or research-focused usage.

Limitations & Caveats

Zero-shot voice cloning is limited to academic research and must not be used illegally. Installation requires specific PyTorch versions tied to CUDA 12.6. The project acknowledges adapting code from other models, potentially implying usage terms from those sources.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
15
Star History
528 stars in the last 16 days

Explore Similar Projects

Feedback? Help us improve.