Discover and explore top open-source AI tools and projects—updated daily.
OsillyResearch paper exploring RL for multimodal LLMs
Top 47.7% on SourcePulse
Vision-R1 introduces a novel approach to enhance reasoning capabilities in Multimodal Large Language Models (MLLMs) by leveraging Reinforcement Learning (RL) with a cold-start initialization strategy. This method addresses the challenge of effectively incentivizing complex reasoning in MLLMs, offering a path to achieve high performance with smaller models. The project targets researchers and developers working on advanced MLLMs who aim to improve their models' reasoning abilities.
How It Works
Vision-R1 employs a two-stage training process. First, a cold-start dataset (Vision-R1-cold) is generated using existing MLLMs and a "modality bridging" technique to create high-quality multimodal Chain-of-Thought (CoT) data. This dataset initializes a base MLLM, resulting in Vision-R1-CI. Second, RL training, specifically Proximal Policy Optimization (PPO) with a progressive stage-wise training (PTST) strategy, is applied to Vision-R1-CI. PTST gradually increases context length restrictions during training, allowing the model to refine its reasoning process and overcome issues like overthinking or favoring shorter, less effective reasoning chains.
Quick Start & Requirements
llava_cot_images and mulberry_images directories, and dataset configurations updated in dataset_info.json.train/cold_start.pip install -r requirements.txt. Flash Attention 2 can be optionally installed. Inference can be performed using Hugging Face Transformers or vLLM.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 month ago
1 day
hiyouga