SpeechGPT-2.0-preview  by OpenMOSS

Real-time spoken dialogue system with GPT-4o-level capabilities

created 6 months ago
347 stars

Top 81.1% on sourcepulse

GitHubView on GitHub
Project Summary

SpeechGPT 2.0-preview is an end-to-end, real-time spoken dialogue system designed for human-like interaction. It targets researchers and developers building advanced conversational AI, offering capabilities like low-latency responses, emotional expression, role-playing, and tool usage, all within a unified speech-and-text model.

How It Works

This system employs a novel speech-text mixed modeling approach, leveraging a custom low-bitrate streaming speech codec (750bps) that jointly models semantics and acoustics. The core innovation is "Codec Patchify," which aggregates codec tokens into patches, reducing cross-modal discrepancies for unified LLM processing. The LLM's hidden states are dual-purposed for both text generation and speech reconstruction via a multi-decoder patch decoder, enabling seamless integration of speech and text capabilities without sacrificing language intelligence.

Quick Start & Requirements

  • Install: Clone the repository, install dependencies (pip3 install -r requirements.txt flash-attn==2.7.3 --no-build-isolation).
  • Models: Download Codec and LLM weights from Huggingface.
  • Run Demo: python3 demo_gradio.py --codec_ckpt_path <path_to_codec> --model_path <path_to_model>
  • Prerequisites: Python 3, git-lfs, PyTorch.
  • Demo: Demo System, Demo Video

Highlighted Details

  • Achieves <200ms latency for real-time interaction.
  • Supports style generalization, multi-emotion/style/voice control, and role-playing.
  • Integrates tool calling, web search, and external knowledge bases.
  • Trained on millions of hours of Chinese speech data.

Maintenance & Community

  • Based on Qwen2.5-7B-Instruct.
  • Supported by Agora and RTE Developer Community for real-time audio transmission.
  • Community program: RTE Community

Licensing & Compatibility

  • Code licensed under Apache 2.0.
  • Model weights are available via Huggingface.

Limitations & Caveats

Currently, the model is trained exclusively on Chinese speech data and lacks English dialogue capabilities.

Health Check
Last commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
29 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Lianmin Zheng Lianmin Zheng(Author of SGLang).

fish-speech by fishaudio

0.3%
23k
Open-source TTS for multilingual speech synthesis
created 1 year ago
updated 1 week ago
Feedback? Help us improve.