r1_reward  by yfzhang114

Multimodal reward modeling via stable reinforcement learning

Created 4 months ago
260 stars

Top 97.7% on SourcePulse

GitHubView on GitHub
Project Summary

R1-Reward addresses the challenge of training robust multimodal reward models by introducing the StableReinforce algorithm and a corresponding model. It targets researchers and developers working on multimodal large language models (MLLMs) and reinforcement learning from human feedback (RLHF), offering improved performance on established benchmarks and a novel, stable training approach.

How It Works

R1-Reward utilizes a novel reinforcement learning method called StableReinforce, an enhancement of the Reinforce++ algorithm. This approach focuses on improving training loss stability, advantage estimation, and reward function design. By stabilizing these core RL components, StableReinforce enables more effective training of multimodal reward models, leading to significant performance gains in evaluating multimodal outputs.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n r1_reward python=3.10 -y), activate it (conda activate r1_reward), and install dependencies (pip install -e .[vllm], pip install flash_attn --no-build-isolation).
  • Prerequisites: Python 3.10, CUDA (implied by flash_attn and vllm), and potentially a GPU for efficient training and inference.
  • Data Preparation: Requires downloading the R1-Reward RL training dataset and processing it to include correct local paths for associated images. Training data must adhere to a specific JSON structure.
  • External Reward Model: For Consistency Reward calculation, an external model's API endpoint needs to be configured in openrlhf/models/remote_rm/math_verifier_mllm.py.
  • Evaluation Data: Benchmark datasets (VL Reward-Bench, MM-RLHF Reward-Bench, Multimodal Reward Bench) need to be downloaded separately, and image/video paths within provided data files must be updated.

Highlighted Details

  • Achieves SOTA performance with significant improvements on VL Reward-Bench (13.5% Voting@15), MM-RLHF Reward-Bench (3.5%), and Multimodal Reward Bench (14.6%).
  • Provides the R1-Reward model, training dataset, and inference code for multiple reward benchmarks.
  • Includes a detailed Python usage example for loading the model and performing inference with image or video inputs.

Maintenance & Community

The project is associated with authors Yi-Fan Zhang, Xingyu Lu, and others, as indicated by the arXiv paper citation. Related projects like MM-RLHF, MME-RealWorld, MME-Survey, Beyond LLaVA-HD, and VITA-1.5 are also mentioned, suggesting an active research group.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README.

Limitations & Caveats

The README implies that users need to manage and correctly path local image/video data for both training and evaluation. The setup for calculating Consistency Reward requires configuring an external API endpoint, which may add complexity. The project is presented with a 2025 arXiv date, suggesting it might be recent or pre-publication.

Health Check
Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.