gemini-multimodal-playground by saharmor

Python app for voice/video interaction with Gemini 2.0

Created 1 year ago

319 stars

Top 85.0% on SourcePulse

Project Summary

This project provides a real-time, multimodal conversational interface with Google's Gemini 2.0 API, enabling voice, video, and screen sharing interactions. It targets developers and power users looking to build interactive AI agents with rich media capabilities, leveraging the currently free Gemini API for immediate experimentation.

How It Works

The application utilizes Gemini 2.0 for processing multimodal inputs (voice, video, screen share) and generating audio responses. It supports real-time streaming of camera and screen data, integrating with the Gemini API for conversational AI. Users can configure system prompts, input modes, voice outputs, and enable/disable features like Google Search and interruptions.

Quick Start & Requirements

Backend: pip install -r requirements.txt and python backend/main.py
Frontend: npm install and npm run dev
Standalone: pip install -r requirements.txt and python standalone.py
Prerequisites: Python 3.12+, Node.js 18+, Google Cloud account, Gemini API key. Tkinter for standalone version (included with Python on macOS/Windows, installable via apt or dnf on Linux).
Setup: Requires cloning the repository, setting up virtual environments, installing dependencies, and configuring a .env file with the Gemini API key.

Highlighted Details

Real-time voice, video, and screen sharing input.
Gemini 2.0 API integration (currently free).
Configurable system prompts, input modes, and voice outputs.
Option to enable Google Search and allow interruptions.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels (Discord/Slack) is provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project notes a potential audio feedback loop issue if the microphone picks up the AI's audio output, suggesting disabling interruptions or using headphones. The "free for now" status of the Gemini API implies potential future costs.

gemini-multimodal-playground by saharmor

Explore Similar Projects

gemini-cursor by 13point5

GPTPortal by Zaki-1052

sagittarius by gregsadetsky

gemini-api-quickstart by google-gemini

swift-realtime-openai by m1guelpf

hertz-dev by Standard-Intelligence

ada by Nlouis38

All-Model-Chat by yeahhe365

gemini-2-live-api-demo by ViaAnthroposBenevolentia

fastrtc by gradio-app

Bard-API by dsdanielpark

livehelperchat by LiveHelperChat