Discover and explore top open-source AI tools and projects—updated daily.
VARGPT-familyVisual autoregressive model for multimodal tasks
Top 95.6% on SourcePulse
VARGPT-v1.1 is a multimodal large language model designed for unified visual understanding and generation tasks, including image captioning, visual question answering (VQA), text-to-image generation, and visual editing. It targets researchers and developers working with multimodal AI, offering improved capabilities over its predecessor through advanced training techniques and an upgraded architecture.
How It Works
VARGPT-v1.1 employs a novel training strategy that combines iterative visual instruction tuning with reinforcement learning via Direct Preference Optimization (DPO). This approach leverages an expanded corpus of 8.3 million visual-generative instruction pairs and an upgraded Qwen2 language backbone. The model architecture is designed for autoregressive generation, enabling it to handle diverse multimodal inputs and outputs efficiently.
Quick Start & Requirements
pip3 install -r requirements.txt (within the VARGPT-family-training directory for training, or inference_v1_1 for inference).Highlighted Details
Maintenance & Community
The project is associated with Peking University and The Chinese University of Hong Kong. Key components are built upon established projects like LLaVA, Qwen2-VL, and LLaMA-Factory. Further community interaction details are not explicitly provided in the README.
Licensing & Compatibility
The project is licensed under the Apache License 2.0, which permits commercial use and modification.
Limitations & Caveats
The README indicates that inference code release was a pending task at the time of writing. While the model supports various tasks, specific performance benchmarks for all capabilities are not detailed beyond mentions of evaluation frameworks.
6 months ago
Inactive
OFA-Sys