This repository provides a Chinese deployment and translation guide for MiniGPT-4, a vision-language model that enhances image understanding with large language models. It's targeted at Chinese-speaking users who want to deploy and utilize advanced multimodal AI capabilities, offering detailed instructions and troubleshooting for a smoother setup experience.
How It Works
MiniGPT-4 aligns a frozen visual encoder (from BLIP-2) with a frozen LLM (Vicuna) using a projection layer. It's trained in two stages: an initial pre-training phase on millions of image-text pairs to enable the LLM to understand images, followed by a computationally efficient fine-tuning stage using a custom, high-quality dataset of ~3500 image-text pairs to significantly improve generation reliability and usability.
Quick Start & Requirements
- Install: Clone the repository, create a Conda environment using
environment.yml
, and activate it (conda activate minigpt4
).
- Prerequisites: Requires Vicuna weights (e.g., Vicuna-13B v1.1) which need to be downloaded and converted to Hugging Face format. This conversion process can be memory-intensive (around 80GB).
- Demo: Run
python demo.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0
.
- GPU Memory: Vicuna-13B in 8-bit mode requires ~23GB GPU memory; Vicuna-7B requires ~11.5GB.
- Resources: Detailed instructions for obtaining and converting LLaMA weights are provided, including links to Hugging Face and IPFS.
- Windows: A guide for Windows deployment issues is linked.
Highlighted Details
- Offers a low-resource mode with Vicuna-7B requiring ~11.5GB GPU memory.
- Detailed steps for preparing Vicuna weights, including downloading delta weights and original LLaMA weights, and converting them.
- Explains the two-stage training process: pre-training on large datasets and fine-tuning on a smaller, high-quality dataset.
- Provides links to pre-trained MiniGPT-4 checkpoints aligned with Vicuna-7B and Vicuna-13B.
Maintenance & Community
- The project is associated with Vision-CAIR at King Abdullah University of Science and Technology.
- Links to a domestic AI commercial application exchange group are provided for community support.
Licensing & Compatibility
- The repository is licensed under the BSD 3-Clause license.
- Code is based on Lavis, also under BSD 3-Clause.
Limitations & Caveats
- Preparing Vicuna weights involves downloading and converting large files, which can be complex and resource-intensive.
- The conversion process for 13B weights requires approximately 80GB of RAM, potentially exceeding typical consumer hardware capabilities.
- Model quantization (8-bit loading) is used by default to reduce GPU memory, which might affect model accuracy.