VARGPT-v1.1  by VARGPT-family

Visual autoregressive model for multimodal tasks

created 4 months ago
261 stars

Top 97.3% on SourcePulse

GitHubView on GitHub
Project Summary

VARGPT-v1.1 is a multimodal large language model designed for unified visual understanding and generation tasks, including image captioning, visual question answering (VQA), text-to-image generation, and visual editing. It targets researchers and developers working with multimodal AI, offering improved capabilities over its predecessor through advanced training techniques and an upgraded architecture.

How It Works

VARGPT-v1.1 employs a novel training strategy that combines iterative visual instruction tuning with reinforcement learning via Direct Preference Optimization (DPO). This approach leverages an expanded corpus of 8.3 million visual-generative instruction pairs and an upgraded Qwen2 language backbone. The model architecture is designed for autoregressive generation, enabling it to handle diverse multimodal inputs and outputs efficiently.

Quick Start & Requirements

  • Install: pip3 install -r requirements.txt (within the VARGPT-family-training directory for training, or inference_v1_1 for inference).
  • Prerequisites: Python 3.9+, PyTorch 2.1.0, CUDA 12.0+. Flash Attention 2.7.3 is recommended for performance.
  • Resources: Requires significant GPU resources for training and inference. Model checkpoints are available on Hugging Face.
  • Links: Technical Report, Webpage, Hugging Face Models, Training Code.

Highlighted Details

  • Achieves emergent image editing capabilities without architectural modifications.
  • Utilizes an 8.3M instruction pair dataset for enhanced training.
  • Upgraded to Qwen2 for improved language understanding.
  • Supports higher image generation resolutions.

Maintenance & Community

The project is associated with Peking University and The Chinese University of Hong Kong. Key components are built upon established projects like LLaVA, Qwen2-VL, and LLaMA-Factory. Further community interaction details are not explicitly provided in the README.

Licensing & Compatibility

The project is licensed under the Apache License 2.0, which permits commercial use and modification.

Limitations & Caveats

The README indicates that inference code release was a pending task at the time of writing. While the model supports various tasks, specific performance benchmarks for all capabilities are not detailed beyond mentions of evaluation frameworks.

Health Check
Last commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
2 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.