gemini-multimodal-playground  by saharmor

Python app for voice/video interaction with Gemini 2.0

created 7 months ago
297 stars

Top 90.4% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a real-time, multimodal conversational interface with Google's Gemini 2.0 API, enabling voice, video, and screen sharing interactions. It targets developers and power users looking to build interactive AI agents with rich media capabilities, leveraging the currently free Gemini API for immediate experimentation.

How It Works

The application utilizes Gemini 2.0 for processing multimodal inputs (voice, video, screen share) and generating audio responses. It supports real-time streaming of camera and screen data, integrating with the Gemini API for conversational AI. Users can configure system prompts, input modes, voice outputs, and enable/disable features like Google Search and interruptions.

Quick Start & Requirements

  • Backend: pip install -r requirements.txt and python backend/main.py
  • Frontend: npm install and npm run dev
  • Standalone: pip install -r requirements.txt and python standalone.py
  • Prerequisites: Python 3.12+, Node.js 18+, Google Cloud account, Gemini API key. Tkinter for standalone version (included with Python on macOS/Windows, installable via apt or dnf on Linux).
  • Setup: Requires cloning the repository, setting up virtual environments, installing dependencies, and configuring a .env file with the Gemini API key.

Highlighted Details

  • Real-time voice, video, and screen sharing input.
  • Gemini 2.0 API integration (currently free).
  • Configurable system prompts, input modes, and voice outputs.
  • Option to enable Google Search and allow interruptions.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels (Discord/Slack) is provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project notes a potential audio feedback loop issue if the microphone picks up the AI's audio output, suggesting disabling interruptions or using headphones. The "free for now" status of the Gemini API implies potential future costs.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
1
Star History
21 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.