Vocalis by Lex-au

AI speech-to-speech assistant enabling natural, multimodal conversations

Created 9 months ago

278 stars

Top 93.5% on SourcePulse

Project Summary

Vocalis is a sophisticated speech-to-speech AI assistant designed for natural, low-latency conversations. It targets users seeking advanced conversational AI with features like mid-speech interruption, AI-initiated follow-ups, and multi-modal capabilities (including image analysis), offering a highly responsive and customizable experience that can leverage local LLM and TTS services.

How It Works

Vocalis employs a modern React frontend and FastAPI backend architecture to deliver a responsive, low-latency conversational experience. Its core innovation lies in its "barge-in" technology, allowing users to interrupt the AI mid-speech for natural flow. It supports AI-initiated greetings and follow-ups, dynamic visual feedback, and integrates with local LLM/TTS services via OpenAI-compatible endpoints, enabling users to run powerful AI assistants entirely offline. The system utilizes Faster-Whisper for ASR, custom VAD, and streaming TTS for immediate audio playback, with optional CUDA acceleration.

Quick Start & Requirements

Installation: Recommended one-click setup via setup.bat (Windows) or ./setup.sh (macOS/Linux). Manual setup involves creating Python virtual environments and running pip install -r requirements.txt for the backend, and npm install for the frontend.
Prerequisites:
- Windows: Python 3.10+ (in PATH), Node.js, npm.
- macOS: Python 3.10+, Homebrew, Node.js, npm.
- Optional: CUDA for GPU acceleration (PyTorch installation handles this).
Resource Footprint: Requires local LLM and TTS services to be running. Specific resource needs depend on the chosen local models.
Documentation: Links to video demonstration, changelog, and project structure are provided within the README.

Highlighted Details

Barge-In Technology: Enables users to interrupt the AI mid-speech for a natural conversational flow.
AI-Initiated Interactions: Features AI-initiated greetings and intelligent follow-up questions during silences.
Multi-Modal Support: Includes image analysis via SmolVLM-256M-Instruct, allowing discussions about uploaded images.
Low-Latency Streaming: Achieves end-to-end latency under 500ms with streaming TTS and adaptive buffering.
Local Service Integration: Works with local LLM/TTS services (e.g., LM Studio, Orpheus-FASTAPI) via OpenAI-compatible endpoints.
Session Management: Robust system for saving, loading, and organizing conversations.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or roadmap were found in the provided text.

Licensing & Compatibility

License: Apache License 2.0.
Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The system's functionality is dependent on the user successfully setting up and running compatible local LLM and TTS services, which can introduce complexity for users unfamiliar with these technologies. While CUDA acceleration is supported, optimal performance may require specific GPU hardware.

Vocalis by Lex-au

Explore Similar Projects

alibabacloud-bailian-speech-demo by aliyun

pi-card by nkasmanoff

OpenVoiceChat by Finity-Alpha

jarvis-ai-assistant by akshayaggarwal99

voice-assistant-whisper-chatgpt by bhattbhavesh91

voicemode by mbailey

open-whispr by HeroTools

jarvis by llm-guy

WhisperFusion by collabora

mini-omni2 by gpt-omni

Linly-Talker by Kedreamix

pipecat by pipecat-ai