parlor by fikrikarim

On-device, real-time multimodal AI for natural conversations

Created 3 months ago

1,893 stars

Top 22.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wes McKinney

Author of Pandas

Project Summary

Parlor offers on-device, real-time multimodal AI, enabling natural voice and vision conversations that run entirely locally. Aimed at users seeking privacy-focused AI interactions and particularly beneficial for language learners, it eliminates server costs and makes advanced AI accessible on personal hardware, envisioning future mobile deployment.

How It Works

The system employs a browser-based frontend capturing microphone and camera input, transmitting audio (PCM) and video (JPEG) via WebSockets to a FastAPI server. This server leverages Gemma 4 E2B (via LiteRT-LM on GPU) for speech and vision understanding, and Kokoro TTS (using MLX on macOS or ONNX on Linux) for speech synthesis. Browser-side Voice Activity Detection (Silero VAD) enables hands-free operation and barge-in capabilities, while sentence-level TTS streaming ensures low-latency audio playback. This architecture provides real-time, natural interaction without relying on external servers.

Quick Start & Requirements

Primary install/run command:

git clone https://github.com/fikrikarim/parlor.git
cd parlor
# Install uv if needed: curl -LsSf https://astral.sh/uv/install.sh | sh
cd src
uv sync
uv run server.py

Access the application at http://localhost:8000.

Non-default prerequisites: Python 3.12+, macOS with Apple Silicon or Linux with a supported GPU.
Resource footprint: Requires approximately 3 GB free RAM for the model. Models are downloaded automatically on first run (~2.6 GB for Gemma 4 E2B, plus TTS models).

Highlighted Details

Performance (Apple M3 Pro): Achieves end-to-end interaction times of ~2.5-3.0 seconds, with speech/vision understanding at ~1.8-2.2s and a decode speed of ~83 tokens/sec on GPU.
Multimodal Interaction: Facilitates natural conversations by processing both voice input and live camera feeds.
Real-time Features: Includes browser-based Voice Activity Detection for hands-free use and barge-in, alongside sentence-level TTS streaming for immediate audio feedback.

Maintenance & Community

The provided README does not detail specific contributors, sponsorships, partnerships, community channels (e.g., Discord/Slack), or a roadmap.

Licensing & Compatibility

License type: Apache 2.0.
Compatibility notes: The Apache 2.0 license is permissive, generally allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

This project is presented as a "research preview" and an "early experiment," with users advised to expect "rough edges and bugs." While not suitable for tasks like agentic coding, it is highlighted as a valuable tool for language learning.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

70 stars in the last 30 days