Real-time spoken dialogue system with GPT-4o-level capabilities
Top 81.1% on sourcepulse
SpeechGPT 2.0-preview is an end-to-end, real-time spoken dialogue system designed for human-like interaction. It targets researchers and developers building advanced conversational AI, offering capabilities like low-latency responses, emotional expression, role-playing, and tool usage, all within a unified speech-and-text model.
How It Works
This system employs a novel speech-text mixed modeling approach, leveraging a custom low-bitrate streaming speech codec (750bps) that jointly models semantics and acoustics. The core innovation is "Codec Patchify," which aggregates codec tokens into patches, reducing cross-modal discrepancies for unified LLM processing. The LLM's hidden states are dual-purposed for both text generation and speech reconstruction via a multi-decoder patch decoder, enabling seamless integration of speech and text capabilities without sacrificing language intelligence.
Quick Start & Requirements
pip3 install -r requirements.txt flash-attn==2.7.3 --no-build-isolation
).python3 demo_gradio.py --codec_ckpt_path <path_to_codec> --model_path <path_to_model>
git-lfs
, PyTorch.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Currently, the model is trained exclusively on Chinese speech data and lacks English dialogue capabilities.
6 months ago
Inactive