MiniGPT-4  by Vision-CAIR

Vision-language model for multi-task learning

created 2 years ago
25,727 stars

Top 1.6% on sourcepulse

GitHubView on GitHub
Project Summary

MiniGPT-4 and MiniGPT-v2 provide unified interfaces for vision-language multi-task learning, enabling users to interact with images using natural language. These models are designed for researchers and developers working on multimodal AI applications, offering enhanced vision-language understanding and a flexible framework for various tasks.

How It Works

The projects leverage a BLIP-2-like architecture, connecting pre-trained vision encoders with large language models (LLMs) like Vicuna or Llama 2. This approach allows the LLM to interpret visual information, enabling complex reasoning and generation capabilities. The models are trained in stages, first aligning visual features with the LLM's embedding space and then fine-tuning for specific multi-task learning objectives.

Quick Start & Requirements

  • Installation: Clone the repository and create a conda environment using environment.yml.
  • Prerequisites: Requires pre-trained LLM weights (Llama 2 Chat 7B, Vicuna V0 13B/7B) and model checkpoints for MiniGPT-v2 or MiniGPT-4.
  • Resources: Demo requires ~11.5GB (7B LLM) to ~23GB (13B LLM) GPU memory, configurable for 8-bit or 16-bit LLM loading.
  • Links: MiniGPT-v2 Project Page, MiniGPT-4 Project Page

Highlighted Details

  • Supports both MiniGPT-4 and the newer MiniGPT-v2.
  • Offers versions compatible with Vicuna and Llama 2 LLMs.
  • Demonstrates strong community adoption with several derivative projects.
  • Provides local demo execution scripts and links to online demos.

Maintenance & Community

The project is actively updated, with recent releases including evaluation and finetuning code for MiniGPT-v2 and a Llama 2 version of MiniGPT-4. Community support is available via Discord.

Licensing & Compatibility

The repository is licensed under the BSD 3-Clause License, with many underlying codes also under BSD 3-Clause. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The project relies on external pre-trained LLM weights, which may have their own licensing or access restrictions. Setting up the environment requires careful management of these dependencies and model checkpoints.

Health Check
Last commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
2
Star History
189 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.