Framework for multi-modality vision language models
Top 15.0% on sourcepulse
Mini-Gemini (MGM) is a framework for multi-modal vision-language models, enabling simultaneous image understanding, reasoning, and generation. It supports a range of LLMs from 2B to 34B parameters, including LLaMA3-based models, and is built upon the LLaVA architecture.
How It Works
MGM employs a dual vision encoder approach, utilizing both low-resolution and high-resolution visual embeddings. A novel "patch info mining" technique facilitates patch-level mining between high-resolution regions and low-resolution visual queries. The Large Language Model (LLM) then integrates text and image information for comprehension and generation.
Quick Start & Requirements
pip install -e .
within a Python 3.10 conda environment. Additional packages like ninja
and flash-attn
are recommended for training.Highlighted Details
Maintenance & Community
The project is actively maintained, with recent updates supporting LLaMA3 and a Hugging Face demo. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
The data and checkpoints are licensed for research use only. They are subject to the licenses of LLaVA, LLaMA, Vicuna, and GPT-4. The dataset is licensed under CC BY-NC 4.0, restricting commercial use.
Limitations & Caveats
The licensing explicitly prohibits commercial use. The project is built on LLaVA, inheriting its dependencies and potential limitations. Training requires significant GPU resources.
1 year ago
1 day