MiniGPT-4-ZH  by RiseInRose

Vision-language model enhances understanding using LLMs

created 2 years ago
860 stars

Top 42.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a Chinese deployment and translation guide for MiniGPT-4, a vision-language model that enhances image understanding with large language models. It's targeted at Chinese-speaking users who want to deploy and utilize advanced multimodal AI capabilities, offering detailed instructions and troubleshooting for a smoother setup experience.

How It Works

MiniGPT-4 aligns a frozen visual encoder (from BLIP-2) with a frozen LLM (Vicuna) using a projection layer. It's trained in two stages: an initial pre-training phase on millions of image-text pairs to enable the LLM to understand images, followed by a computationally efficient fine-tuning stage using a custom, high-quality dataset of ~3500 image-text pairs to significantly improve generation reliability and usability.

Quick Start & Requirements

  • Install: Clone the repository, create a Conda environment using environment.yml, and activate it (conda activate minigpt4).
  • Prerequisites: Requires Vicuna weights (e.g., Vicuna-13B v1.1) which need to be downloaded and converted to Hugging Face format. This conversion process can be memory-intensive (around 80GB).
  • Demo: Run python demo.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0.
  • GPU Memory: Vicuna-13B in 8-bit mode requires ~23GB GPU memory; Vicuna-7B requires ~11.5GB.
  • Resources: Detailed instructions for obtaining and converting LLaMA weights are provided, including links to Hugging Face and IPFS.
  • Windows: A guide for Windows deployment issues is linked.

Highlighted Details

  • Offers a low-resource mode with Vicuna-7B requiring ~11.5GB GPU memory.
  • Detailed steps for preparing Vicuna weights, including downloading delta weights and original LLaMA weights, and converting them.
  • Explains the two-stage training process: pre-training on large datasets and fine-tuning on a smaller, high-quality dataset.
  • Provides links to pre-trained MiniGPT-4 checkpoints aligned with Vicuna-7B and Vicuna-13B.

Maintenance & Community

  • The project is associated with Vision-CAIR at King Abdullah University of Science and Technology.
  • Links to a domestic AI commercial application exchange group are provided for community support.

Licensing & Compatibility

  • The repository is licensed under the BSD 3-Clause license.
  • Code is based on Lavis, also under BSD 3-Clause.

Limitations & Caveats

  • Preparing Vicuna weights involves downloading and converting large files, which can be complex and resource-intensive.
  • The conversion process for 13B weights requires approximately 80GB of RAM, potentially exceeding typical consumer hardware capabilities.
  • Model quantization (8-bit loading) is used by default to reduce GPU memory, which might affect model accuracy.
Health Check
Last commit

23 hours ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.