r1_reward by yfzhang114

Multimodal reward modeling via stable reinforcement learning

Created 8 months ago

274 stars

Top 94.4% on SourcePulse

Project Summary

R1-Reward addresses the challenge of training robust multimodal reward models by introducing the StableReinforce algorithm and a corresponding model. It targets researchers and developers working on multimodal large language models (MLLMs) and reinforcement learning from human feedback (RLHF), offering improved performance on established benchmarks and a novel, stable training approach.

How It Works

R1-Reward utilizes a novel reinforcement learning method called StableReinforce, an enhancement of the Reinforce++ algorithm. This approach focuses on improving training loss stability, advantage estimation, and reward function design. By stabilizing these core RL components, StableReinforce enables more effective training of multimodal reward models, leading to significant performance gains in evaluating multimodal outputs.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (conda create -n r1_reward python=3.10 -y), activate it (conda activate r1_reward), and install dependencies (pip install -e .[vllm], pip install flash_attn --no-build-isolation).
Prerequisites: Python 3.10, CUDA (implied by flash_attn and vllm), and potentially a GPU for efficient training and inference.
Data Preparation: Requires downloading the R1-Reward RL training dataset and processing it to include correct local paths for associated images. Training data must adhere to a specific JSON structure.
External Reward Model: For Consistency Reward calculation, an external model's API endpoint needs to be configured in openrlhf/models/remote_rm/math_verifier_mllm.py.
Evaluation Data: Benchmark datasets (VL Reward-Bench, MM-RLHF Reward-Bench, Multimodal Reward Bench) need to be downloaded separately, and image/video paths within provided data files must be updated.

Highlighted Details

Achieves SOTA performance with significant improvements on VL Reward-Bench (13.5% Voting@15), MM-RLHF Reward-Bench (3.5%), and Multimodal Reward Bench (14.6%).
Provides the R1-Reward model, training dataset, and inference code for multiple reward benchmarks.
Includes a detailed Python usage example for loading the model and performing inference with image or video inputs.

Maintenance & Community

The project is associated with authors Yi-Fan Zhang, Xingyu Lu, and others, as indicated by the arXiv paper citation. Related projects like MM-RLHF, MME-RealWorld, MME-Survey, Beyond LLaVA-HD, and VITA-1.5 are also mentioned, suggesting an active research group.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README.

Limitations & Caveats

The README implies that users need to manage and correctly path local image/video data for both training and evaluation. The setup for calculating Consistency Reward requires configuring an external API endpoint, which may add complexity. The project is presented with a 2025 arXiv date, suggesting it might be recent or pre-publication.

r1_reward by yfzhang114

Explore Similar Projects

cobra by h-zhao1997

GroundingGPT by lzw-lzw

LLaVA-RLHF by llava-rlhf

MA-LMM by boheumd

gill by kohjingyu

PandaGPT by yxuansu

R1-Omni by HumanMLLM

VideoLLaMA3 by DAMO-NLP-SG

Visual-RFT by Liuziyu77

flow_grpo by yifan123

Bagel by ByteDance-Seed

Janus by deepseek-ai