CleanS2S  by opendilab

S2S agent prototype for high-quality, streaming speech interaction

created 10 months ago
461 stars

Top 66.7% on sourcepulse

GitHubView on GitHub
Project Summary

CleanS2S is a single-file, streaming, full-duplex Speech-to-Speech (S2S) interactive agent prototype designed for researchers and users to experience Linguistic User Interfaces (LUIs). It aims to provide a GPT-4o-like conversational experience, enabling rapid validation of S2S pipeline ideas.

How It Works

The agent comprises Automatic Speech Recognition (ASR), a Large Language Model (LLM), and Text-to-Speech (TTS), orchestrated with WebSocket-based Receiver (VAD) and Sender components. It leverages multi-threading and queues for asynchronous, non-blocking, real-time streaming. Full-duplex interaction and interruption handling are supported, with strategies to enhance conversational engagement beyond typical turn-based chatbots. Web search and Retrieval-Augmented Generation (RAG) are integrated for accessing external information.

Quick Start & Requirements

  • Installation: Clone the repository, install backend dependencies (pip install -r requirements.txt), and optionally RAG dependencies (pip install -r backend/requirements-rag.txt). Install funasr (v1.1.6 recommended) and cosyvoice.
  • Models: Download ASR models (paraformer-zh, ct-punc, fsmn-vad) and TTS model (CosyVoice-300M).
  • LLM: Uses LLM APIs (e.g., DeepSeek) by default; local LLMs can be configured.
  • Running Server: python3 -u s2s_server_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0 --stt_model_name <your-asr-path> --enable_llm_api --lm_model_name "deepseek-chat" --lm_model_url "https://api.deepseek.com" --tts_model_name <your-tts-path> --ref_dir <ref-audio-path> --enable_interruption
  • Frontend: Recommended via Docker; requires Node.js and pnpm for local setup.
  • Web Search/RAG: Requires Serper API key and an embedding model (e.g., all-MiniLM-L6-v2).
  • Resources: Requires downloading specific ASR/TTS models and potentially LLM models. API keys for LLM and Serper are needed for enhanced functionality.

Highlighted Details

  • Single-file implementation for easy understanding and modification.
  • Real-time streaming with full-duplex and interruption capabilities.
  • Integration of Web Search and RAG for enhanced knowledge access.
  • Supports customized LLMs and backend parameters.
  • Frontend client available via Docker or local setup.

Maintenance & Community

  • Active development with a roadmap including inference speed optimization, long-term memory, and more RAG strategies.
  • Community engagement via GitHub Issues and Discord. WeChat group available via invitation.

Licensing & Compatibility

  • Released under the Apache 2.0 license.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is a prototype, and the README notes limitations on token output due to computing resource constraints. Inference speed optimization is listed as a future roadmap item.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
63 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.