On-Device-Speech-to-Speech-Conversational-AI by asiff00

On-device speech-to-speech conversational AI

Created 1 year ago

255 stars

Top 98.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Dan Guido

Cofounder of Trail of Bits

Project Summary

Summary

This project delivers a real-time, on-CPU conversational AI system enabling two-way speech communication. It targets users seeking local, responsive AI interactions without cloud dependencies, offering fluid conversations with immediate responses and natural interruption handling through a continuous streaming architecture.

How It Works

A multi-threaded architecture orchestrates a pipeline: Voice Activity Detection (Pyannote) feeds into Speech Recognition (Whisper), then to a Language Model (Ollama/qwen2.5), processed by a custom TextChunker, and finally synthesized via Voice Synthesis (Kokoro). Components communicate via queues, enabling independent operation and responsiveness. Novel latency reduction techniques include priority-based text chunking and LLM prompting with leading filler words for natural, immediate interaction and interruption handling.

Quick Start & Requirements

Installation: Requires Python 3.8+ (tested 3.12), eSpeak NG (sudo apt install -y espeak-ng on Linux), and Ollama (https://ollama.ai/). Clone the repo, run git lfs pull for models, configure .env with a HuggingFace token, and install dependencies via pip install -r requirements.txt.
Execution: Start Ollama (ollama run qwen2.5:0.5b-instruct-q8_0), then run python speech_to_speech.py.
Hardware: Tested on an AMD Ryzen 5600G with 16GB RAM and SSD, no GPU required.
Demo: Video available at https://youtu.be/x92FLnwf-nA.

Highlighted Details

Performance: Achieved ~2s latency on a CPU-only setup (Ryzen 5600G), with an average response time of ~1.5s from the end of user speech.
Latency Reduction: Employs priority-based text chunking (prioritizing sentence breaks, then semantic breaks) and LLM prompting with leading filler words ("umm", "so") to significantly reduce perceived latency by up to 50-70%.
Interruption Handling: A custom TextChunker and interrupt mechanism allow users to naturally interrupt the AI mid-response, enhancing conversational flow.

Maintenance & Community

The provided README does not detail specific contributors, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. This lack of clarity presents a significant barrier for commercial use or integration into closed-source projects.

Limitations & Caveats

The system is optimized and tested for on-CPU execution, with the README noting that GPU utilization would likely yield substantial performance gains. The project appears to be a personal implementation ("in my test system"). The absence of a clear license is a critical adoption blocker.

Health Check

Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days