gemini-live-api-examples  by google-gemini

Realtime multimodal agent framework for voice and video

Created 3 months ago
260 stars

Top 97.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides examples for the Gemini Live API, enabling developers to build multimodal, real-time voice and video agents. It targets applications requiring low-latency, natural conversational experiences, such as interactive e-commerce assistants, gaming NPCs, next-gen interfaces, and healthcare companions, by processing continuous streams of audio, vision, and text.

How It Works

The Live API processes continuous streams of audio, video, or text over a stateful WebSocket connection (WSS) to enable low-latency, real-time interactions with Gemini models. This approach facilitates natural, human-like conversational experiences by delivering immediate spoken responses and allowing users to interrupt the model (barge-in). Its novelty lies in enabling multimodal agent capabilities, processing diverse input types concurrently for dynamic applications.

Quick Start & Requirements

Examples are provided for integration via the Gen AI SDK (Python), raw WebSocket connections (JavaScript frontend, Python backend), and minimal command-line applications (Python, Node.js). Key technical specifications include raw 16-bit PCM audio (16kHz, little-endian) and JPEG image/video (<= 1FPS) inputs, with raw 16-bit PCM audio (24kHz, little-endian) and text outputs, all managed over a stateful WebSocket (WSS) protocol.

Highlighted Details

Key features include extensive multilingual support (70 languages), real-time barge-in for responsive interactions, integrated tool use (function calling, Google Search), automatic audio transcription, proactive audio control, and affective dialog for adaptive response styles.

Maintenance & Community

The project showcases integration with a robust ecosystem of partners, including LiveKit, Pipecat by Daily, Fishjam by Software Mansion, Vision Agents by Stream, Voximplant, Agent Development Kit (ADK), and Firebase AI SDK, indicating active development and broad adoption potential within real-time communication platforms.

Licensing & Compatibility

The repository's README does not specify a license. Compatibility is geared towards building real-time audio and video applications, with integrations supporting WebRTC and WebSockets.

Limitations & Caveats

No explicit limitations, alpha/beta status, or known bugs are detailed in the provided README. The examples focus on specific integration patterns and technical specifications for the Live API.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
2
Star History
50 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.