SpeechGPT-2.0-preview by OpenMOSS

Real-time spoken dialogue system with GPT-4o-level capabilities

Created 1 year ago

368 stars

Top 77.0% on SourcePulse

Project Summary

SpeechGPT 2.0-preview is an end-to-end, real-time spoken dialogue system designed for human-like interaction. It targets researchers and developers building advanced conversational AI, offering capabilities like low-latency responses, emotional expression, role-playing, and tool usage, all within a unified speech-and-text model.

How It Works

This system employs a novel speech-text mixed modeling approach, leveraging a custom low-bitrate streaming speech codec (750bps) that jointly models semantics and acoustics. The core innovation is "Codec Patchify," which aggregates codec tokens into patches, reducing cross-modal discrepancies for unified LLM processing. The LLM's hidden states are dual-purposed for both text generation and speech reconstruction via a multi-decoder patch decoder, enabling seamless integration of speech and text capabilities without sacrificing language intelligence.

Quick Start & Requirements

Install: Clone the repository, install dependencies (pip3 install -r requirements.txt flash-attn==2.7.3 --no-build-isolation).
Models: Download Codec and LLM weights from Huggingface.
Run Demo: python3 demo_gradio.py --codec_ckpt_path <path_to_codec> --model_path <path_to_model>
Prerequisites: Python 3, git-lfs, PyTorch.
Demo: Demo System, Demo Video

Highlighted Details

Achieves <200ms latency for real-time interaction.
Supports style generalization, multi-emotion/style/voice control, and role-playing.
Integrates tool calling, web search, and external knowledge bases.
Trained on millions of hours of Chinese speech data.

Maintenance & Community

Based on Qwen2.5-7B-Instruct.
Supported by Agora and RTE Developer Community for real-time audio transmission.
Community program: RTE Community

Licensing & Compatibility

Code licensed under Apache 2.0.
Model weights are available via Huggingface.

Limitations & Caveats

Currently, the model is trained exclusively on Chinese speech data and lacks English dialogue capabilities.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

1

Star History

0 stars in the last 30 days

Explore Similar Projects

AIVoiceChat by KoljaB

Voice chat for low-latency AI companion interaction

Created 2 years ago

Updated 8 months ago

VITA-Audio by VITA-MLLM

Speech model for fast audio-text token generation

Created 10 months ago

Updated 9 months ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Luis Capelo

Luis Capelo(Cofounder of Lightning AI), and

1 more.

vui by fluxions-ai

Conversational speech models for on-device use

Created 8 months ago

Updated 2 weeks ago

FireRedTTS by FireRedTeam

LLM-empowered TTS system for research

Created 1 year ago

Updated 5 months ago

sesame_csm_openai by phildougherty

OpenAI-compatible TTS API for voice cloning

Created 11 months ago

Updated 5 months ago

fast-voice-assistant by dsa

AI voice assistant demo with <500ms response

Created 1 year ago

Updated 1 year ago

ComfyUI-Qwen-TTS by flybirdxx

Advanced ComfyUI nodes for speech synthesis and voice AI

Created 1 month ago

Updated 2 weeks ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind),

Jonathan Ragan-Kelley

Jonathan Ragan-Kelley(Professor at MIT), and

3 more.

WhisperSpeech by WhisperSpeech

Open-source text-to-speech system built by inverting Whisper

Created 3 years ago

Updated 2 months ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI).

Step-Audio by stepfun-ai

Speech interaction framework for multilingual conversation and controllable speech synthesis

Created 1 year ago

Updated 1 week ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI),

Alex Chen

Alex Chen(Cofounder of Nexa AI), and

1 more.

higgs-audio by boson-ai

Expressive text-to-audio generation model

Created 7 months ago

Updated 1 month ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Didier Lopes

Didier Lopes(Founder of OpenBB).

Zonos by Zyphra

Open-weight text-to-speech model for expressive, high-quality speech generation

Created 1 year ago

Updated 11 months ago

CosyVoice by FunAudioLLM

Voice generation model for inference, training, and deployment

Created 1 year ago

Updated 2 weeks ago

Feedback? Help us improve.