MiniGPT-4 by Vision-CAIR

Vision-language model for multi-task learning

Created 2 years ago

25,766 stars

Top 1.5% on SourcePulse

View on GitHub

20 Experts Love This Project

Pawel Garbacki

Cofounder of Fireworks AI

Forrest Iandola

Author of SqueezeNet; Research Scientist at Meta

Jesse Clark

Cofounder of Marqo

Yaowei Zheng

Author of LLaMA-Factory

and 16 more!

Project Summary

MiniGPT-4 and MiniGPT-v2 provide unified interfaces for vision-language multi-task learning, enabling users to interact with images using natural language. These models are designed for researchers and developers working on multimodal AI applications, offering enhanced vision-language understanding and a flexible framework for various tasks.

How It Works

The projects leverage a BLIP-2-like architecture, connecting pre-trained vision encoders with large language models (LLMs) like Vicuna or Llama 2. This approach allows the LLM to interpret visual information, enabling complex reasoning and generation capabilities. The models are trained in stages, first aligning visual features with the LLM's embedding space and then fine-tuning for specific multi-task learning objectives.

Quick Start & Requirements

Installation: Clone the repository and create a conda environment using environment.yml.
Prerequisites: Requires pre-trained LLM weights (Llama 2 Chat 7B, Vicuna V0 13B/7B) and model checkpoints for MiniGPT-v2 or MiniGPT-4.
Resources: Demo requires ~11.5GB (7B LLM) to ~23GB (13B LLM) GPU memory, configurable for 8-bit or 16-bit LLM loading.
Links: MiniGPT-v2 Project Page, MiniGPT-4 Project Page

Highlighted Details

Supports both MiniGPT-4 and the newer MiniGPT-v2.
Offers versions compatible with Vicuna and Llama 2 LLMs.
Demonstrates strong community adoption with several derivative projects.
Provides local demo execution scripts and links to online demos.

Maintenance & Community

The project is actively updated, with recent releases including evaluation and finetuning code for MiniGPT-v2 and a Llama 2 version of MiniGPT-4. Community support is available via Discord.

Licensing & Compatibility

The repository is licensed under the BSD 3-Clause License, with many underlying codes also under BSD 3-Clause. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The project relies on external pre-trained LLM weights, which may have their own licensing or access restrictions. Setting up the environment requires careful management of these dependencies and model checkpoints.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

32 stars in the last 30 days