CleanS2S by opendilab

S2S agent prototype for high-quality, streaming speech interaction

created 10 months ago

461 stars

Top 66.7% on sourcepulse

Project Summary

CleanS2S is a single-file, streaming, full-duplex Speech-to-Speech (S2S) interactive agent prototype designed for researchers and users to experience Linguistic User Interfaces (LUIs). It aims to provide a GPT-4o-like conversational experience, enabling rapid validation of S2S pipeline ideas.

How It Works

The agent comprises Automatic Speech Recognition (ASR), a Large Language Model (LLM), and Text-to-Speech (TTS), orchestrated with WebSocket-based Receiver (VAD) and Sender components. It leverages multi-threading and queues for asynchronous, non-blocking, real-time streaming. Full-duplex interaction and interruption handling are supported, with strategies to enhance conversational engagement beyond typical turn-based chatbots. Web search and Retrieval-Augmented Generation (RAG) are integrated for accessing external information.

Quick Start & Requirements

Installation: Clone the repository, install backend dependencies (pip install -r requirements.txt), and optionally RAG dependencies (pip install -r backend/requirements-rag.txt). Install funasr (v1.1.6 recommended) and cosyvoice.
Models: Download ASR models (paraformer-zh, ct-punc, fsmn-vad) and TTS model (CosyVoice-300M).
LLM: Uses LLM APIs (e.g., DeepSeek) by default; local LLMs can be configured.
Running Server: python3 -u s2s_server_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0 --stt_model_name <your-asr-path> --enable_llm_api --lm_model_name "deepseek-chat" --lm_model_url "https://api.deepseek.com" --tts_model_name <your-tts-path> --ref_dir <ref-audio-path> --enable_interruption
Frontend: Recommended via Docker; requires Node.js and pnpm for local setup.
Web Search/RAG: Requires Serper API key and an embedding model (e.g., all-MiniLM-L6-v2).
Resources: Requires downloading specific ASR/TTS models and potentially LLM models. API keys for LLM and Serper are needed for enhanced functionality.

Highlighted Details

Single-file implementation for easy understanding and modification.
Real-time streaming with full-duplex and interruption capabilities.
Integration of Web Search and RAG for enhanced knowledge access.
Supports customized LLMs and backend parameters.
Frontend client available via Docker or local setup.

Maintenance & Community

Active development with a roadmap including inference speed optimization, long-term memory, and more RAG strategies.
Community engagement via GitHub Issues and Discord. WeChat group available via invitation.

Licensing & Compatibility

Released under the Apache 2.0 license.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is a prototype, and the README notes limitations on token output due to computing resource constraints. Inference speed optimization is listed as a future roadmap item.

CleanS2S by opendilab

Explore Similar Projects

smartcat by efugier

model.nvim by gsuuon

aiaio by abhishekkrthakur

ChatPilot by shibing624

GPTPortal by Zaki-1052

open-assistant-api by MLT-OSS

mods by charmbracelet

RisuAI by kwaroran

langchat by TyCoding

langchainrb by patterns-ai-core

aichat by sigoden

Langchain-Chatchat by chatchat-space