Discover and explore top open-source AI tools and projects—updated daily.
Multimodal reward modeling via stable reinforcement learning
Top 97.7% on SourcePulse
R1-Reward addresses the challenge of training robust multimodal reward models by introducing the StableReinforce algorithm and a corresponding model. It targets researchers and developers working on multimodal large language models (MLLMs) and reinforcement learning from human feedback (RLHF), offering improved performance on established benchmarks and a novel, stable training approach.
How It Works
R1-Reward utilizes a novel reinforcement learning method called StableReinforce, an enhancement of the Reinforce++ algorithm. This approach focuses on improving training loss stability, advantage estimation, and reward function design. By stabilizing these core RL components, StableReinforce enables more effective training of multimodal reward models, leading to significant performance gains in evaluating multimodal outputs.
Quick Start & Requirements
conda create -n r1_reward python=3.10 -y
), activate it (conda activate r1_reward
), and install dependencies (pip install -e .[vllm]
, pip install flash_attn --no-build-isolation
).flash_attn
and vllm
), and potentially a GPU for efficient training and inference.openrlhf/models/remote_rm/math_verifier_mllm.py
.Highlighted Details
Maintenance & Community
The project is associated with authors Yi-Fan Zhang, Xingyu Lu, and others, as indicated by the arXiv paper citation. Related projects like MM-RLHF, MME-RealWorld, MME-Survey, Beyond LLaVA-HD, and VITA-1.5 are also mentioned, suggesting an active research group.
Licensing & Compatibility
The repository's license is not explicitly stated in the provided README.
Limitations & Caveats
The README implies that users need to manage and correctly path local image/video data for both training and evaluation. The setup for calculating Consistency Reward requires configuring an external API endpoint, which may add complexity. The project is presented with a 2025 arXiv date, suggesting it might be recent or pre-publication.
4 months ago
Inactive