Vision-R1  by Osilly

Research paper exploring RL for multimodal LLMs

created 6 months ago
661 stars

Top 51.8% on sourcepulse

GitHubView on GitHub
Project Summary

Vision-R1 introduces a novel approach to enhance reasoning capabilities in Multimodal Large Language Models (MLLMs) by leveraging Reinforcement Learning (RL) with a cold-start initialization strategy. This method addresses the challenge of effectively incentivizing complex reasoning in MLLMs, offering a path to achieve high performance with smaller models. The project targets researchers and developers working on advanced MLLMs who aim to improve their models' reasoning abilities.

How It Works

Vision-R1 employs a two-stage training process. First, a cold-start dataset (Vision-R1-cold) is generated using existing MLLMs and a "modality bridging" technique to create high-quality multimodal Chain-of-Thought (CoT) data. This dataset initializes a base MLLM, resulting in Vision-R1-CI. Second, RL training, specifically Proximal Policy Optimization (PPO) with a progressive stage-wise training (PTST) strategy, is applied to Vision-R1-CI. PTST gradually increases context length restrictions during training, allowing the model to refine its reasoning process and overcome issues like overthinking or favoring shorter, less effective reasoning chains.

Quick Start & Requirements

  • Cold-start Data Preparation: Download Vision-R1-cold from Hugging Face, with images sourced from LLaVA-CoT-100k and Mulberry-SFT. Images need to be placed in llava_cot_images and mulberry_images directories, and dataset configurations updated in dataset_info.json.
  • Cold-start Training: Uses LLaMA-Factory. Requires 8x4 or 8x 80GB GPUs. Training scripts are located in train/cold_start.
  • Inference: Install requirements via pip install -r requirements.txt. Flash Attention 2 can be optionally installed. Inference can be performed using Hugging Face Transformers or vLLM.
  • Resources: Training the cold-start phase requires significant GPU resources (8x 80GB GPUs).

Highlighted Details

  • Vision-R1-7B achieves comparable performance to MLLMs with 70B+ parameters.
  • The PTST strategy progressively loosens context length restrictions (4K, 8K, 16K tokens).
  • The RL reward function utilizes a hard formatting result reward function (HFRRF).
  • Inference supports both Hugging Face Transformers and vLLM, with vLLM recommended for deployment.

Maintenance & Community

  • The project is actively developing, with plans to release 72B Vision-R1 and scale training to 8 GPUs.
  • Key milestones include the release of the paper, inference code, cold-start dataset, and the 7B model in March/April 2025.

Licensing & Compatibility

  • The README does not explicitly state the license for the code, datasets, or weights. It mentions that these will be released.

Limitations & Caveats

  • The RL training phase is marked as "Coming soon."
  • The README indicates that the final Vision-R1 model did not undergo the third stage of PTST training.
Health Check
Last commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
114 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.