MiniGPT-4-ZH by RiseInRose

Vision-language model enhances understanding using LLMs

Created 2 years ago

858 stars

Top 41.7% on SourcePulse

Project Summary

This repository provides a Chinese deployment and translation guide for MiniGPT-4, a vision-language model that enhances image understanding with large language models. It's targeted at Chinese-speaking users who want to deploy and utilize advanced multimodal AI capabilities, offering detailed instructions and troubleshooting for a smoother setup experience.

How It Works

MiniGPT-4 aligns a frozen visual encoder (from BLIP-2) with a frozen LLM (Vicuna) using a projection layer. It's trained in two stages: an initial pre-training phase on millions of image-text pairs to enable the LLM to understand images, followed by a computationally efficient fine-tuning stage using a custom, high-quality dataset of ~3500 image-text pairs to significantly improve generation reliability and usability.

Quick Start & Requirements

Install: Clone the repository, create a Conda environment using environment.yml, and activate it (conda activate minigpt4).
Prerequisites: Requires Vicuna weights (e.g., Vicuna-13B v1.1) which need to be downloaded and converted to Hugging Face format. This conversion process can be memory-intensive (around 80GB).
Demo: Run python demo.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0.
GPU Memory: Vicuna-13B in 8-bit mode requires ~23GB GPU memory; Vicuna-7B requires ~11.5GB.
Resources: Detailed instructions for obtaining and converting LLaMA weights are provided, including links to Hugging Face and IPFS.
Windows: A guide for Windows deployment issues is linked.

Highlighted Details

Offers a low-resource mode with Vicuna-7B requiring ~11.5GB GPU memory.
Detailed steps for preparing Vicuna weights, including downloading delta weights and original LLaMA weights, and converting them.
Explains the two-stage training process: pre-training on large datasets and fine-tuning on a smaller, high-quality dataset.
Provides links to pre-trained MiniGPT-4 checkpoints aligned with Vicuna-7B and Vicuna-13B.

Maintenance & Community

The project is associated with Vision-CAIR at King Abdullah University of Science and Technology.
Links to a domestic AI commercial application exchange group are provided for community support.

Licensing & Compatibility

The repository is licensed under the BSD 3-Clause license.
Code is based on Lavis, also under BSD 3-Clause.

Limitations & Caveats

Preparing Vicuna weights involves downloading and converting large files, which can be complex and resource-intensive.
The conversion process for 13B weights requires approximately 80GB of RAM, potentially exceeding typical consumer hardware capabilities.
Model quantization (8-bit loading) is used by default to reduce GPU memory, which might affect model accuracy.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days