parlor  by fikrikarim

On-device, real-time multimodal AI for natural conversations

Created 6 days ago

New!

1,323 stars

Top 29.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Parlor offers on-device, real-time multimodal AI, enabling natural voice and vision conversations that run entirely locally. Aimed at users seeking privacy-focused AI interactions and particularly beneficial for language learners, it eliminates server costs and makes advanced AI accessible on personal hardware, envisioning future mobile deployment.

How It Works

The system employs a browser-based frontend capturing microphone and camera input, transmitting audio (PCM) and video (JPEG) via WebSockets to a FastAPI server. This server leverages Gemma 4 E2B (via LiteRT-LM on GPU) for speech and vision understanding, and Kokoro TTS (using MLX on macOS or ONNX on Linux) for speech synthesis. Browser-side Voice Activity Detection (Silero VAD) enables hands-free operation and barge-in capabilities, while sentence-level TTS streaming ensures low-latency audio playback. This architecture provides real-time, natural interaction without relying on external servers.

Quick Start & Requirements

  • Primary install/run command:
    git clone https://github.com/fikrikarim/parlor.git
    cd parlor
    # Install uv if needed: curl -LsSf https://astral.sh/uv/install.sh | sh
    cd src
    uv sync
    uv run server.py
    
    Access the application at http://localhost:8000.
  • Non-default prerequisites: Python 3.12+, macOS with Apple Silicon or Linux with a supported GPU.
  • Resource footprint: Requires approximately 3 GB free RAM for the model. Models are downloaded automatically on first run (~2.6 GB for Gemma 4 E2B, plus TTS models).

Highlighted Details

  • Performance (Apple M3 Pro): Achieves end-to-end interaction times of ~2.5-3.0 seconds, with speech/vision understanding at ~1.8-2.2s and a decode speed of ~83 tokens/sec on GPU.
  • Multimodal Interaction: Facilitates natural conversations by processing both voice input and live camera feeds.
  • Real-time Features: Includes browser-based Voice Activity Detection for hands-free use and barge-in, alongside sentence-level TTS streaming for immediate audio feedback.

Maintenance & Community

The provided README does not detail specific contributors, sponsorships, partnerships, community channels (e.g., Discord/Slack), or a roadmap.

Licensing & Compatibility

  • License type: Apache 2.0.
  • Compatibility notes: The Apache 2.0 license is permissive, generally allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

This project is presented as a "research preview" and an "early experiment," with users advised to expect "rough edges and bugs." While not suitable for tasks like agentic coding, it is highlighted as a valuable tool for language learning.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
9
Issues (30d)
6
Star History
1,332 stars in the last 6 days

Explore Similar Projects

Feedback? Help us improve.