MGM by JIA-Lab-research

Framework for multi-modality vision language models

Created 1 year ago

3,330 stars

Top 14.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

Mini-Gemini (MGM) is a framework for multi-modal vision-language models, enabling simultaneous image understanding, reasoning, and generation. It supports a range of LLMs from 2B to 34B parameters, including LLaMA3-based models, and is built upon the LLaVA architecture.

How It Works

MGM employs a dual vision encoder approach, utilizing both low-resolution and high-resolution visual embeddings. A novel "patch info mining" technique facilitates patch-level mining between high-resolution regions and low-resolution visual queries. The Large Language Model (LLM) then integrates text and image information for comprehension and generation.

Quick Start & Requirements

Install: Clone the repo and install dependencies using pip install -e . within a Python 3.10 conda environment. Additional packages like ninja and flash-attn are recommended for training.
Prerequisites: Requires PyTorch, Transformers (>=4.38.0 for 2B models), and potentially PaddleOCR for enhanced OCR capabilities. Pretrained weights for LLMs and vision encoders need to be downloaded separately and organized according to the provided structure.
Resources: Training is demonstrated on 8 A100 GPUs (80GB VRAM). Inference supports multi-GPU, 4-bit, and 8-bit quantization.
Links: Project Page, Demo, Paper

Highlighted Details

Supports LLaMA3-based models.
Offers both 336px and 672px high-resolution image understanding.
Provides a CLI for inference and a Gradio Web UI for interactive use.
Includes comprehensive evaluation results on benchmarks like TextVQA, MMB, and MMMU.

Maintenance & Community

The project is actively maintained, with recent updates supporting LLaMA3 and a Hugging Face demo. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The data and checkpoints are licensed for research use only. They are subject to the licenses of LLaVA, LLaMA, Vicuna, and GPT-4. The dataset is licensed under CC BY-NC 4.0, restricting commercial use.

Limitations & Caveats

The licensing explicitly prohibits commercial use. The project is built on LLaVA, inheriting its dependencies and potential limitations. Training requires significant GPU resources.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days