Gemini by kyegomez

Open-source implementation of Gemini, Google's multimodal model

Created 2 years ago

457 stars

Top 66.1% on SourcePulse

Project Summary

This repository provides an open-source implementation of Google's Gemini model, aiming to replicate its multimodal capabilities for processing text, images, and audio. It's targeted at researchers and developers interested in building and experimenting with advanced multimodal AI architectures, offering a foundation for creating models that can understand and generate content across different data types.

How It Works

The core of the implementation is a transformer architecture that directly ingests tokenized inputs from various modalities. Image embeddings are fed directly into the transformer, bypassing a separate visual transformer encoder, similar to Fuyu's architecture but extended for multimodality. Special tokens denote modality switches within the input sequence. The model supports optimized components like Flash Attention and Query-Key normalization for improved performance.

Quick Start & Requirements

Install: pip3 install gemini-torch
Prerequisites: Python, PyTorch. The README shows examples with reduced dimensions for demonstration; full-scale training would require significant computational resources.
Documentation: Gemini Transformer Usage, Full Multi-Modal Gemini Processes, LongGemini

Highlighted Details

Implements a transformer that natively handles interleaved text, image, and audio inputs.
Utilizes optimized components like Flash Attention, RoPE, ALiBi, and QK Norm.
Includes a LongGemini variant with Ring Attention for extended sequence lengths.
Features a MultimodalSentencePieceTokenizer compatible with LLaMA's tokenizer.

Maintenance & Community

Community support is available via the Agora Discord channel.
A project board is linked for tracking development tasks.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The project is described as needing further implementation for full image, audio, and video processing. The provided code examples use significantly reduced model dimensions for demonstration purposes, and full-scale training is not detailed.

Health Check

Last Commit

2 days ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days