S2S agent prototype for high-quality, streaming speech interaction
Top 66.7% on sourcepulse
CleanS2S is a single-file, streaming, full-duplex Speech-to-Speech (S2S) interactive agent prototype designed for researchers and users to experience Linguistic User Interfaces (LUIs). It aims to provide a GPT-4o-like conversational experience, enabling rapid validation of S2S pipeline ideas.
How It Works
The agent comprises Automatic Speech Recognition (ASR), a Large Language Model (LLM), and Text-to-Speech (TTS), orchestrated with WebSocket-based Receiver (VAD) and Sender components. It leverages multi-threading and queues for asynchronous, non-blocking, real-time streaming. Full-duplex interaction and interruption handling are supported, with strategies to enhance conversational engagement beyond typical turn-based chatbots. Web search and Retrieval-Augmented Generation (RAG) are integrated for accessing external information.
Quick Start & Requirements
pip install -r requirements.txt
), and optionally RAG dependencies (pip install -r backend/requirements-rag.txt
). Install funasr
(v1.1.6 recommended) and cosyvoice
.python3 -u s2s_server_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0 --stt_model_name <your-asr-path> --enable_llm_api --lm_model_name "deepseek-chat" --lm_model_url "https://api.deepseek.com" --tts_model_name <your-tts-path> --ref_dir <ref-audio-path> --enable_interruption
all-MiniLM-L6-v2
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is a prototype, and the README notes limitations on token output due to computing resource constraints. Inference speed optimization is listed as a future roadmap item.
1 month ago
1 day