MiniGPT-4-ZH  by RiseInRose

Vision-language model enhances understanding using LLMs

Created 2 years ago
859 stars

Top 41.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a Chinese deployment and translation guide for MiniGPT-4, a vision-language model that enhances image understanding with large language models. It's targeted at Chinese-speaking users who want to deploy and utilize advanced multimodal AI capabilities, offering detailed instructions and troubleshooting for a smoother setup experience.

How It Works

MiniGPT-4 aligns a frozen visual encoder (from BLIP-2) with a frozen LLM (Vicuna) using a projection layer. It's trained in two stages: an initial pre-training phase on millions of image-text pairs to enable the LLM to understand images, followed by a computationally efficient fine-tuning stage using a custom, high-quality dataset of ~3500 image-text pairs to significantly improve generation reliability and usability.

Quick Start & Requirements

  • Install: Clone the repository, create a Conda environment using environment.yml, and activate it (conda activate minigpt4).
  • Prerequisites: Requires Vicuna weights (e.g., Vicuna-13B v1.1) which need to be downloaded and converted to Hugging Face format. This conversion process can be memory-intensive (around 80GB).
  • Demo: Run python demo.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0.
  • GPU Memory: Vicuna-13B in 8-bit mode requires ~23GB GPU memory; Vicuna-7B requires ~11.5GB.
  • Resources: Detailed instructions for obtaining and converting LLaMA weights are provided, including links to Hugging Face and IPFS.
  • Windows: A guide for Windows deployment issues is linked.

Highlighted Details

  • Offers a low-resource mode with Vicuna-7B requiring ~11.5GB GPU memory.
  • Detailed steps for preparing Vicuna weights, including downloading delta weights and original LLaMA weights, and converting them.
  • Explains the two-stage training process: pre-training on large datasets and fine-tuning on a smaller, high-quality dataset.
  • Provides links to pre-trained MiniGPT-4 checkpoints aligned with Vicuna-7B and Vicuna-13B.

Maintenance & Community

  • The project is associated with Vision-CAIR at King Abdullah University of Science and Technology.
  • Links to a domestic AI commercial application exchange group are provided for community support.

Licensing & Compatibility

  • The repository is licensed under the BSD 3-Clause license.
  • Code is based on Lavis, also under BSD 3-Clause.

Limitations & Caveats

  • Preparing Vicuna weights involves downloading and converting large files, which can be complex and resource-intensive.
  • The conversion process for 13B weights requires approximately 80GB of RAM, potentially exceeding typical consumer hardware capabilities.
  • Model quantization (8-bit loading) is used by default to reduce GPU memory, which might affect model accuracy.
Health Check
Last Commit

14 hours ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0%
353
Vision-language research paper using LLMs
Created 2 years ago
Updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.