On-Device-Speech-to-Speech-Conversational-AI  by asiff00

On-device speech-to-speech conversational AI

Created 1 year ago
250 stars

Top 100.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

This project delivers a real-time, on-CPU conversational AI system enabling two-way speech communication. It targets users seeking local, responsive AI interactions without cloud dependencies, offering fluid conversations with immediate responses and natural interruption handling through a continuous streaming architecture.

How It Works

A multi-threaded architecture orchestrates a pipeline: Voice Activity Detection (Pyannote) feeds into Speech Recognition (Whisper), then to a Language Model (Ollama/qwen2.5), processed by a custom TextChunker, and finally synthesized via Voice Synthesis (Kokoro). Components communicate via queues, enabling independent operation and responsiveness. Novel latency reduction techniques include priority-based text chunking and LLM prompting with leading filler words for natural, immediate interaction and interruption handling.

Quick Start & Requirements

  • Installation: Requires Python 3.8+ (tested 3.12), eSpeak NG (sudo apt install -y espeak-ng on Linux), and Ollama (https://ollama.ai/). Clone the repo, run git lfs pull for models, configure .env with a HuggingFace token, and install dependencies via pip install -r requirements.txt.
  • Execution: Start Ollama (ollama run qwen2.5:0.5b-instruct-q8_0), then run python speech_to_speech.py.
  • Hardware: Tested on an AMD Ryzen 5600G with 16GB RAM and SSD, no GPU required.
  • Demo: Video available at https://youtu.be/x92FLnwf-nA.

Highlighted Details

  • Performance: Achieved ~2s latency on a CPU-only setup (Ryzen 5600G), with an average response time of ~1.5s from the end of user speech.
  • Latency Reduction: Employs priority-based text chunking (prioritizing sentence breaks, then semantic breaks) and LLM prompting with leading filler words ("umm", "so") to significantly reduce perceived latency by up to 50-70%.
  • Interruption Handling: A custom TextChunker and interrupt mechanism allow users to naturally interrupt the AI mid-response, enhancing conversational flow.

Maintenance & Community

The provided README does not detail specific contributors, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. This lack of clarity presents a significant barrier for commercial use or integration into closed-source projects.

Limitations & Caveats

The system is optimized and tested for on-CPU execution, with the README noting that GPU utilization would likely yield substantial performance gains. The project appears to be a personal implementation ("in my test system"). The absence of a clear license is a critical adoption blocker.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.