SpeechGPT-2.0-preview  by OpenMOSS

Real-time spoken dialogue system with GPT-4o-level capabilities

Created 7 months ago
354 stars

Top 78.8% on SourcePulse

GitHubView on GitHub
Project Summary

SpeechGPT 2.0-preview is an end-to-end, real-time spoken dialogue system designed for human-like interaction. It targets researchers and developers building advanced conversational AI, offering capabilities like low-latency responses, emotional expression, role-playing, and tool usage, all within a unified speech-and-text model.

How It Works

This system employs a novel speech-text mixed modeling approach, leveraging a custom low-bitrate streaming speech codec (750bps) that jointly models semantics and acoustics. The core innovation is "Codec Patchify," which aggregates codec tokens into patches, reducing cross-modal discrepancies for unified LLM processing. The LLM's hidden states are dual-purposed for both text generation and speech reconstruction via a multi-decoder patch decoder, enabling seamless integration of speech and text capabilities without sacrificing language intelligence.

Quick Start & Requirements

  • Install: Clone the repository, install dependencies (pip3 install -r requirements.txt flash-attn==2.7.3 --no-build-isolation).
  • Models: Download Codec and LLM weights from Huggingface.
  • Run Demo: python3 demo_gradio.py --codec_ckpt_path <path_to_codec> --model_path <path_to_model>
  • Prerequisites: Python 3, git-lfs, PyTorch.
  • Demo: Demo System, Demo Video

Highlighted Details

  • Achieves <200ms latency for real-time interaction.
  • Supports style generalization, multi-emotion/style/voice control, and role-playing.
  • Integrates tool calling, web search, and external knowledge bases.
  • Trained on millions of hours of Chinese speech data.

Maintenance & Community

  • Based on Qwen2.5-7B-Instruct.
  • Supported by Agora and RTE Developer Community for real-time audio transmission.
  • Community program: RTE Community

Licensing & Compatibility

  • Code licensed under Apache 2.0.
  • Model weights are available via Huggingface.

Limitations & Caveats

Currently, the model is trained exclusively on Chinese speech data and lacks English dialogue capabilities.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.