FireRedTTS2 by FireRedTeam

Streaming TTS for natural, long-form dialogue

Created 5 months ago

1,342 stars

Top 29.6% on SourcePulse

Project Summary

FireRedTTS-2 provides a long-form streaming Text-to-Speech (TTS) system for multi-speaker dialogue generation, delivering stable, natural speech with context-aware prosody. It targets researchers and developers in conversational AI, podcasting, and chatbot development, offering high-quality, low-latency synthesis with advanced zero-shot voice cloning and multilingual capabilities.

How It Works

The system utilizes a novel dual-transformer architecture with a 12.5Hz streaming speech tokenizer, enabling flexible, sentence-by-sentence generation and ultra-low first-packet latency (as low as 140ms on an L20 GPU). It supports long conversational speech (3 mins, 4 speakers, scalable) and offers multilingual capabilities across 7 languages, including zero-shot voice cloning for cross-lingual and code-switching scenarios. Random timbre generation is also supported.

Quick Start & Requirements

Installation requires cloning the repo, setting up a Python 3.11 Conda environment, and installing PyTorch with CUDA 12.6 support. Dependencies are managed via requirements.txt. Pre-trained models are available via Git LFS from Hugging Face. A Gradio web UI demo is provided for easy generation (python gradio_demo.py).

Highlighted Details

Long Conversational Speech: Generates up to 3 minutes of dialogue with 4 speakers, scalable.
Multilingual & Zero-Shot Cloning: Supports 7 languages with cross-lingual and code-switching voice cloning.
Ultra-Low Latency: Achieves first-packet latency as low as 140ms on an L20 GPU.
Random Timbre Generation: Facilitates synthetic data creation.
High Stability: Demonstrates high speaker similarity and low WER/CER.

Maintenance & Community

The roadmap includes releasing an enhanced multilingual model, fine-tuning code, and an end-to-end text-to-blog pipeline in October 2025. No specific community channels or contributor details are listed.

Licensing & Compatibility

No explicit license is stated. A disclaimer restricts zero-shot voice cloning strictly to academic research purposes, prohibiting illegal activities, implying a non-commercial or research-focused usage.

Limitations & Caveats

Zero-shot voice cloning is limited to academic research and must not be used illegally. Installation requires specific PyTorch versions tied to CUDA 12.6. The project acknowledges adapting code from other models, potentially implying usage terms from those sources.

FireRedTTS2 by FireRedTeam

Explore Similar Projects

MGM-Omni by JIA-Lab-research

LLaMA-Omni2 by ictnlp

ComfyUI-F5-TTS by niknah

Voice-Clone-Studio by FranckyB

ComfyUI_IndexTTS by billwuhao

ComfyUI-VibeVoice by wildminder

ComfyUI-Qwen-TTS by flybirdxx

ZipVoice by k2-fsa

Orpheus-TTS by canopyai

Zonos by Zyphra

Qwen3-TTS by QwenLM

CosyVoice by FunAudioLLM