Research paper exploring RL for multimodal LLMs
Top 51.8% on sourcepulse
Vision-R1 introduces a novel approach to enhance reasoning capabilities in Multimodal Large Language Models (MLLMs) by leveraging Reinforcement Learning (RL) with a cold-start initialization strategy. This method addresses the challenge of effectively incentivizing complex reasoning in MLLMs, offering a path to achieve high performance with smaller models. The project targets researchers and developers working on advanced MLLMs who aim to improve their models' reasoning abilities.
How It Works
Vision-R1 employs a two-stage training process. First, a cold-start dataset (Vision-R1-cold) is generated using existing MLLMs and a "modality bridging" technique to create high-quality multimodal Chain-of-Thought (CoT) data. This dataset initializes a base MLLM, resulting in Vision-R1-CI. Second, RL training, specifically Proximal Policy Optimization (PPO) with a progressive stage-wise training (PTST) strategy, is applied to Vision-R1-CI. PTST gradually increases context length restrictions during training, allowing the model to refine its reasoning process and overcome issues like overthinking or favoring shorter, less effective reasoning chains.
Quick Start & Requirements
llava_cot_images
and mulberry_images
directories, and dataset configurations updated in dataset_info.json
.train/cold_start
.pip install -r requirements.txt
. Flash Attention 2 can be optionally installed. Inference can be performed using Hugging Face Transformers or vLLM.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
3 weeks ago
1 day