MGM  by dvlab-research

Framework for multi-modality vision language models

created 1 year ago
3,301 stars

Top 15.0% on sourcepulse

GitHubView on GitHub
Project Summary

Mini-Gemini (MGM) is a framework for multi-modal vision-language models, enabling simultaneous image understanding, reasoning, and generation. It supports a range of LLMs from 2B to 34B parameters, including LLaMA3-based models, and is built upon the LLaVA architecture.

How It Works

MGM employs a dual vision encoder approach, utilizing both low-resolution and high-resolution visual embeddings. A novel "patch info mining" technique facilitates patch-level mining between high-resolution regions and low-resolution visual queries. The Large Language Model (LLM) then integrates text and image information for comprehension and generation.

Quick Start & Requirements

  • Install: Clone the repo and install dependencies using pip install -e . within a Python 3.10 conda environment. Additional packages like ninja and flash-attn are recommended for training.
  • Prerequisites: Requires PyTorch, Transformers (>=4.38.0 for 2B models), and potentially PaddleOCR for enhanced OCR capabilities. Pretrained weights for LLMs and vision encoders need to be downloaded separately and organized according to the provided structure.
  • Resources: Training is demonstrated on 8 A100 GPUs (80GB VRAM). Inference supports multi-GPU, 4-bit, and 8-bit quantization.
  • Links: Project Page, Demo, Paper

Highlighted Details

  • Supports LLaMA3-based models.
  • Offers both 336px and 672px high-resolution image understanding.
  • Provides a CLI for inference and a Gradio Web UI for interactive use.
  • Includes comprehensive evaluation results on benchmarks like TextVQA, MMB, and MMMU.

Maintenance & Community

The project is actively maintained, with recent updates supporting LLaMA3 and a Hugging Face demo. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The data and checkpoints are licensed for research use only. They are subject to the licenses of LLaVA, LLaMA, Vicuna, and GPT-4. The dataset is licensed under CC BY-NC 4.0, restricting commercial use.

Limitations & Caveats

The licensing explicitly prohibits commercial use. The project is built on LLaVA, inheriting its dependencies and potential limitations. Training requires significant GPU resources.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
43 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.