Visual autoregressive model for multimodal tasks
Top 97.3% on SourcePulse
VARGPT-v1.1 is a multimodal large language model designed for unified visual understanding and generation tasks, including image captioning, visual question answering (VQA), text-to-image generation, and visual editing. It targets researchers and developers working with multimodal AI, offering improved capabilities over its predecessor through advanced training techniques and an upgraded architecture.
How It Works
VARGPT-v1.1 employs a novel training strategy that combines iterative visual instruction tuning with reinforcement learning via Direct Preference Optimization (DPO). This approach leverages an expanded corpus of 8.3 million visual-generative instruction pairs and an upgraded Qwen2 language backbone. The model architecture is designed for autoregressive generation, enabling it to handle diverse multimodal inputs and outputs efficiently.
Quick Start & Requirements
pip3 install -r requirements.txt
(within the VARGPT-family-training
directory for training, or inference_v1_1
for inference).Highlighted Details
Maintenance & Community
The project is associated with Peking University and The Chinese University of Hong Kong. Key components are built upon established projects like LLaVA, Qwen2-VL, and LLaMA-Factory. Further community interaction details are not explicitly provided in the README.
Licensing & Compatibility
The project is licensed under the Apache License 2.0, which permits commercial use and modification.
Limitations & Caveats
The README indicates that inference code release was a pending task at the time of writing. While the model supports various tasks, specific performance benchmarks for all capabilities are not detailed beyond mentions of evaluation frameworks.
4 months ago
Inactive