Gemini  by kyegomez

Open-source implementation of Gemini, Google's multimodal model

Created 2 years ago
459 stars

Top 65.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides an open-source implementation of Google's Gemini model, aiming to replicate its multimodal capabilities for processing text, images, and audio. It's targeted at researchers and developers interested in building and experimenting with advanced multimodal AI architectures, offering a foundation for creating models that can understand and generate content across different data types.

How It Works

The core of the implementation is a transformer architecture that directly ingests tokenized inputs from various modalities. Image embeddings are fed directly into the transformer, bypassing a separate visual transformer encoder, similar to Fuyu's architecture but extended for multimodality. Special tokens denote modality switches within the input sequence. The model supports optimized components like Flash Attention and Query-Key normalization for improved performance.

Quick Start & Requirements

Highlighted Details

  • Implements a transformer that natively handles interleaved text, image, and audio inputs.
  • Utilizes optimized components like Flash Attention, RoPE, ALiBi, and QK Norm.
  • Includes a LongGemini variant with Ring Attention for extended sequence lengths.
  • Features a MultimodalSentencePieceTokenizer compatible with LLaMA's tokenizer.

Maintenance & Community

  • Community support is available via the Agora Discord channel.
  • A project board is linked for tracking development tasks.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

The project is described as needing further implementation for full image, audio, and video processing. The provided code examples use significantly reduced model dimensions for demonstration purposes, and full-scale training is not detailed.

Health Check
Last Commit

3 days ago

Responsiveness

1 week

Pull Requests (30d)
9
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI) and Phil Wang Phil Wang(Prolific Research Paper Implementer).

Cosmos-Tokenizer by NVIDIA

0.1%
2k
Suite of neural tokenizers for image and video processing
Created 10 months ago
Updated 7 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.