Gemini  by kyegomez

Open-source implementation of Gemini, Google's multimodal model

created 1 year ago
462 stars

Top 66.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an open-source implementation of Google's Gemini model, aiming to replicate its multimodal capabilities for processing text, images, and audio. It's targeted at researchers and developers interested in building and experimenting with advanced multimodal AI architectures, offering a foundation for creating models that can understand and generate content across different data types.

How It Works

The core of the implementation is a transformer architecture that directly ingests tokenized inputs from various modalities. Image embeddings are fed directly into the transformer, bypassing a separate visual transformer encoder, similar to Fuyu's architecture but extended for multimodality. Special tokens denote modality switches within the input sequence. The model supports optimized components like Flash Attention and Query-Key normalization for improved performance.

Quick Start & Requirements

Highlighted Details

  • Implements a transformer that natively handles interleaved text, image, and audio inputs.
  • Utilizes optimized components like Flash Attention, RoPE, ALiBi, and QK Norm.
  • Includes a LongGemini variant with Ring Attention for extended sequence lengths.
  • Features a MultimodalSentencePieceTokenizer compatible with LLaMA's tokenizer.

Maintenance & Community

  • Community support is available via the Agora Discord channel.
  • A project board is linked for tracking development tasks.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

The project is described as needing further implementation for full image, audio, and video processing. The provided code examples use significantly reduced model dimensions for demonstration purposes, and full-scale training is not detailed.

Health Check
Last commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
4
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Phil Wang Phil Wang(Prolific Research Paper Implementer), and
4 more.

vit-pytorch by lucidrains

0.3%
24k
PyTorch library for Vision Transformer variants and related techniques
created 4 years ago
updated 6 days ago
Feedback? Help us improve.