Vision-R1 by Osilly

Research paper exploring RL for multimodal LLMs

Created 11 months ago

748 stars

Top 46.5% on SourcePulse

Project Summary

Vision-R1 introduces a novel approach to enhance reasoning capabilities in Multimodal Large Language Models (MLLMs) by leveraging Reinforcement Learning (RL) with a cold-start initialization strategy. This method addresses the challenge of effectively incentivizing complex reasoning in MLLMs, offering a path to achieve high performance with smaller models. The project targets researchers and developers working on advanced MLLMs who aim to improve their models' reasoning abilities.

How It Works

Vision-R1 employs a two-stage training process. First, a cold-start dataset (Vision-R1-cold) is generated using existing MLLMs and a "modality bridging" technique to create high-quality multimodal Chain-of-Thought (CoT) data. This dataset initializes a base MLLM, resulting in Vision-R1-CI. Second, RL training, specifically Proximal Policy Optimization (PPO) with a progressive stage-wise training (PTST) strategy, is applied to Vision-R1-CI. PTST gradually increases context length restrictions during training, allowing the model to refine its reasoning process and overcome issues like overthinking or favoring shorter, less effective reasoning chains.

Quick Start & Requirements

Cold-start Data Preparation: Download Vision-R1-cold from Hugging Face, with images sourced from LLaVA-CoT-100k and Mulberry-SFT. Images need to be placed in llava_cot_images and mulberry_images directories, and dataset configurations updated in dataset_info.json.
Cold-start Training: Uses LLaMA-Factory. Requires 8x4 or 8x 80GB GPUs. Training scripts are located in train/cold_start.
Inference: Install requirements via pip install -r requirements.txt. Flash Attention 2 can be optionally installed. Inference can be performed using Hugging Face Transformers or vLLM.
Resources: Training the cold-start phase requires significant GPU resources (8x 80GB GPUs).

Highlighted Details

Vision-R1-7B achieves comparable performance to MLLMs with 70B+ parameters.
The PTST strategy progressively loosens context length restrictions (4K, 8K, 16K tokens).
The RL reward function utilizes a hard formatting result reward function (HFRRF).
Inference supports both Hugging Face Transformers and vLLM, with vLLM recommended for deployment.

Maintenance & Community

The project is actively developing, with plans to release 72B Vision-R1 and scale training to 8 GPUs.
Key milestones include the release of the paper, inference code, cold-start dataset, and the 7B model in March/April 2025.

Licensing & Compatibility

The README does not explicitly state the license for the code, datasets, or weights. It mentions that these will be released.

Limitations & Caveats

The RL training phase is marked as "Coming soon."
The README indicates that the final Vision-R1 model did not undergo the third stage of PTST training.

Health Check

Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

10 stars in the last 30 days

Explore Similar Projects

limit-of-RLVR by LeapLabTHU

Investigating RLVR's impact on LLM reasoning

Created 8 months ago

Updated 3 weeks ago

Tool-Star by RUC-NLPIR

LLM multi-tool reasoning powered by reinforcement learning

Created 8 months ago

Updated 1 week ago

SimpleTIR by ltzheng

LLMs for multi-turn tool-integrated reasoning with RL

Created 6 months ago

Updated 3 months ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

ReasonFlux by Gen-Verse

LLM post-training algorithms for data selection, RL, and inference

Created 11 months ago

Updated 3 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

Tina by shangshang-wang

LoRA reasoning models

Created 9 months ago

Updated 3 months ago

lmm-r1 by TideDra

RL framework for multimodal reasoning in 3B LMMs

Created 11 months ago

Updated 8 months ago

Awesome-RL-based-Reasoning-MLLMs by Sun-Haoyuan23

Curated list for RL-based reasoning in multimodal LLMs

Created 10 months ago

Updated 1 month ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI) and

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

Agent-R1 by 0russwest0

RL framework for training LLM agents via end-to-end reinforcement learning

Created 10 months ago

Updated 1 month ago

LLaVA-CoT by PKU-YuanGroup

VLM research paper for step-by-step reasoning

Created 1 year ago

Updated 1 month ago

Mulberry by HJYao00

MLLM research paper for reasoning/reflection via collective Monte Carlo Tree Search

Created 1 year ago

Updated 3 months ago

train-deepseek-r1 by FareedKhan-dev

Replicate DeepSeek R1 LLM training from scratch

Created 11 months ago

Updated 9 months ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera) and

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

Skywork-R1V by SkyworkAI

Multimodal model for advanced visual/text reasoning, using chain-of-thought

Created 10 months ago

Updated 3 weeks ago

Feedback? Help us improve.