VARGPT-v1.1 by VARGPT-family

Visual autoregressive model for multimodal tasks

Created 9 months ago

269 stars

Top 95.6% on SourcePulse

Project Summary

VARGPT-v1.1 is a multimodal large language model designed for unified visual understanding and generation tasks, including image captioning, visual question answering (VQA), text-to-image generation, and visual editing. It targets researchers and developers working with multimodal AI, offering improved capabilities over its predecessor through advanced training techniques and an upgraded architecture.

How It Works

VARGPT-v1.1 employs a novel training strategy that combines iterative visual instruction tuning with reinforcement learning via Direct Preference Optimization (DPO). This approach leverages an expanded corpus of 8.3 million visual-generative instruction pairs and an upgraded Qwen2 language backbone. The model architecture is designed for autoregressive generation, enabling it to handle diverse multimodal inputs and outputs efficiently.

Quick Start & Requirements

Install: pip3 install -r requirements.txt (within the VARGPT-family-training directory for training, or inference_v1_1 for inference).
Prerequisites: Python 3.9+, PyTorch 2.1.0, CUDA 12.0+. Flash Attention 2.7.3 is recommended for performance.
Resources: Requires significant GPU resources for training and inference. Model checkpoints are available on Hugging Face.
Links: Technical Report, Webpage, Hugging Face Models, Training Code.

Highlighted Details

Achieves emergent image editing capabilities without architectural modifications.
Utilizes an 8.3M instruction pair dataset for enhanced training.
Upgraded to Qwen2 for improved language understanding.
Supports higher image generation resolutions.

Maintenance & Community

The project is associated with Peking University and The Chinese University of Hong Kong. Key components are built upon established projects like LLaVA, Qwen2-VL, and LLaMA-Factory. Further community interaction details are not explicitly provided in the README.

Licensing & Compatibility

The project is licensed under the Apache License 2.0, which permits commercial use and modification.

Limitations & Caveats

The README indicates that inference code release was a pending task at the time of writing. While the model supports various tasks, specific performance benchmarks for all capabilities are not detailed beyond mentions of evaluation frameworks.

VARGPT-v1.1 by VARGPT-family

Explore Similar Projects

GoT by rongyaofang

OmniGen2 by VectorSpaceLab

Nexus-Gen by modelscope

VARGPT by VARGPT-family

SEED-X by AILab-CVC

Lumina-mGPT by Alpha-VLLM

Liquid by FoundationVision

MiniGPT-5 by eric-ai-lab

Lumina-mGPT-2.0 by Alpha-VLLM

TediGAN by IIGROUP

MiniGPT-4-ZH by RiseInRose

OFA by OFA-Sys