MGM  by dvlab-research

Framework for multi-modality vision language models

Created 1 year ago
3,317 stars

Top 14.6% on SourcePulse

GitHubView on GitHub
Project Summary

Mini-Gemini (MGM) is a framework for multi-modal vision-language models, enabling simultaneous image understanding, reasoning, and generation. It supports a range of LLMs from 2B to 34B parameters, including LLaMA3-based models, and is built upon the LLaVA architecture.

How It Works

MGM employs a dual vision encoder approach, utilizing both low-resolution and high-resolution visual embeddings. A novel "patch info mining" technique facilitates patch-level mining between high-resolution regions and low-resolution visual queries. The Large Language Model (LLM) then integrates text and image information for comprehension and generation.

Quick Start & Requirements

  • Install: Clone the repo and install dependencies using pip install -e . within a Python 3.10 conda environment. Additional packages like ninja and flash-attn are recommended for training.
  • Prerequisites: Requires PyTorch, Transformers (>=4.38.0 for 2B models), and potentially PaddleOCR for enhanced OCR capabilities. Pretrained weights for LLMs and vision encoders need to be downloaded separately and organized according to the provided structure.
  • Resources: Training is demonstrated on 8 A100 GPUs (80GB VRAM). Inference supports multi-GPU, 4-bit, and 8-bit quantization.
  • Links: Project Page, Demo, Paper

Highlighted Details

  • Supports LLaMA3-based models.
  • Offers both 336px and 672px high-resolution image understanding.
  • Provides a CLI for inference and a Gradio Web UI for interactive use.
  • Includes comprehensive evaluation results on benchmarks like TextVQA, MMB, and MMMU.

Maintenance & Community

The project is actively maintained, with recent updates supporting LLaMA3 and a Hugging Face demo. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The data and checkpoints are licensed for research use only. They are subject to the licenses of LLaVA, LLaMA, Vicuna, and GPT-4. The dataset is licensed under CC BY-NC 4.0, restricting commercial use.

Limitations & Caveats

The licensing explicitly prohibits commercial use. The project is built on LLaVA, inheriting its dependencies and potential limitations. Training requires significant GPU resources.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
353
Vision-language research paper using LLMs
Created 2 years ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.