Vocalis  by Lex-au

AI speech-to-speech assistant enabling natural, multimodal conversations

Created 7 months ago
255 stars

Top 98.8% on SourcePulse

GitHubView on GitHub
Project Summary

Vocalis is a sophisticated speech-to-speech AI assistant designed for natural, low-latency conversations. It targets users seeking advanced conversational AI with features like mid-speech interruption, AI-initiated follow-ups, and multi-modal capabilities (including image analysis), offering a highly responsive and customizable experience that can leverage local LLM and TTS services.

How It Works

Vocalis employs a modern React frontend and FastAPI backend architecture to deliver a responsive, low-latency conversational experience. Its core innovation lies in its "barge-in" technology, allowing users to interrupt the AI mid-speech for natural flow. It supports AI-initiated greetings and follow-ups, dynamic visual feedback, and integrates with local LLM/TTS services via OpenAI-compatible endpoints, enabling users to run powerful AI assistants entirely offline. The system utilizes Faster-Whisper for ASR, custom VAD, and streaming TTS for immediate audio playback, with optional CUDA acceleration.

Quick Start & Requirements

  • Installation: Recommended one-click setup via setup.bat (Windows) or ./setup.sh (macOS/Linux). Manual setup involves creating Python virtual environments and running pip install -r requirements.txt for the backend, and npm install for the frontend.
  • Prerequisites:
    • Windows: Python 3.10+ (in PATH), Node.js, npm.
    • macOS: Python 3.10+, Homebrew, Node.js, npm.
    • Optional: CUDA for GPU acceleration (PyTorch installation handles this).
  • Resource Footprint: Requires local LLM and TTS services to be running. Specific resource needs depend on the chosen local models.
  • Documentation: Links to video demonstration, changelog, and project structure are provided within the README.

Highlighted Details

  • Barge-In Technology: Enables users to interrupt the AI mid-speech for a natural conversational flow.
  • AI-Initiated Interactions: Features AI-initiated greetings and intelligent follow-up questions during silences.
  • Multi-Modal Support: Includes image analysis via SmolVLM-256M-Instruct, allowing discussions about uploaded images.
  • Low-Latency Streaming: Achieves end-to-end latency under 500ms with streaming TTS and adaptive buffering.
  • Local Service Integration: Works with local LLM/TTS services (e.g., LM Studio, Orpheus-FASTAPI) via OpenAI-compatible endpoints.
  • Session Management: Robust system for saving, loading, and organizing conversations.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or roadmap were found in the provided text.

Licensing & Compatibility

  • License: Apache License 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The system's functionality is dependent on the user successfully setting up and running compatible local LLM and TTS services, which can introduce complexity for users unfamiliar with these technologies. While CUDA acceleration is supported, optimal performance may require specific GPU hardware.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.