Python app for voice/video interaction with Gemini 2.0
Top 90.4% on sourcepulse
This project provides a real-time, multimodal conversational interface with Google's Gemini 2.0 API, enabling voice, video, and screen sharing interactions. It targets developers and power users looking to build interactive AI agents with rich media capabilities, leveraging the currently free Gemini API for immediate experimentation.
How It Works
The application utilizes Gemini 2.0 for processing multimodal inputs (voice, video, screen share) and generating audio responses. It supports real-time streaming of camera and screen data, integrating with the Gemini API for conversational AI. Users can configure system prompts, input modes, voice outputs, and enable/disable features like Google Search and interruptions.
Quick Start & Requirements
pip install -r requirements.txt
and python backend/main.py
npm install
and npm run dev
pip install -r requirements.txt
and python standalone.py
apt
or dnf
on Linux)..env
file with the Gemini API key.Highlighted Details
Maintenance & Community
No specific information on contributors, sponsorships, or community channels (Discord/Slack) is provided in the README.
Licensing & Compatibility
The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project notes a potential audio feedback loop issue if the microphone picks up the AI's audio output, suggesting disabling interruptions or using headphones. The "free for now" status of the Gemini API implies potential future costs.
2 weeks ago
1 day