MiniGPT-4-ZH  by RiseInRose

Vision-language model enhances understanding using LLMs

Created 3 years ago
862 stars

Top 41.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a Chinese deployment and translation guide for MiniGPT-4, a vision-language model that enhances image understanding with large language models. It's targeted at Chinese-speaking users who want to deploy and utilize advanced multimodal AI capabilities, offering detailed instructions and troubleshooting for a smoother setup experience.

How It Works

MiniGPT-4 aligns a frozen visual encoder (from BLIP-2) with a frozen LLM (Vicuna) using a projection layer. It's trained in two stages: an initial pre-training phase on millions of image-text pairs to enable the LLM to understand images, followed by a computationally efficient fine-tuning stage using a custom, high-quality dataset of ~3500 image-text pairs to significantly improve generation reliability and usability.

Quick Start & Requirements

  • Install: Clone the repository, create a Conda environment using environment.yml, and activate it (conda activate minigpt4).
  • Prerequisites: Requires Vicuna weights (e.g., Vicuna-13B v1.1) which need to be downloaded and converted to Hugging Face format. This conversion process can be memory-intensive (around 80GB).
  • Demo: Run python demo.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0.
  • GPU Memory: Vicuna-13B in 8-bit mode requires ~23GB GPU memory; Vicuna-7B requires ~11.5GB.
  • Resources: Detailed instructions for obtaining and converting LLaMA weights are provided, including links to Hugging Face and IPFS.
  • Windows: A guide for Windows deployment issues is linked.

Highlighted Details

  • Offers a low-resource mode with Vicuna-7B requiring ~11.5GB GPU memory.
  • Detailed steps for preparing Vicuna weights, including downloading delta weights and original LLaMA weights, and converting them.
  • Explains the two-stage training process: pre-training on large datasets and fine-tuning on a smaller, high-quality dataset.
  • Provides links to pre-trained MiniGPT-4 checkpoints aligned with Vicuna-7B and Vicuna-13B.

Maintenance & Community

  • The project is associated with Vision-CAIR at King Abdullah University of Science and Technology.
  • Links to a domestic AI commercial application exchange group are provided for community support.

Licensing & Compatibility

  • The repository is licensed under the BSD 3-Clause license.
  • Code is based on Lavis, also under BSD 3-Clause.

Limitations & Caveats

  • Preparing Vicuna weights involves downloading and converting large files, which can be complex and resource-intensive.
  • The conversion process for 13B weights requires approximately 80GB of RAM, potentially exceeding typical consumer hardware capabilities.
  • Model quantization (8-bit loading) is used by default to reduce GPU memory, which might affect model accuracy.
Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.