VLM research for reinforcing generalization with minimal cost
Top 12.9% on sourcepulse
R1-V is an open-source framework for Reinforcement Learning from Human Feedback (RLHF) in Vision-Language Models (VLMs), aiming to enhance generalization capabilities with minimal cost. It targets researchers and developers working on visual agents and general vision-language intelligence, offering improved algorithm efficiency and task diversity.
How It Works
The project implements a Reinforcement Learning from Human Feedback (RLHF) approach, specifically GRPO (Generalized Proximal Policy Optimization), to fine-tune VLMs. This method aims to improve the model's ability to generalize across various visual reasoning tasks by learning from generated data and feedback, potentially leading to more robust and capable models.
Quick Start & Requirements
conda create -n r1-v python=3.11
) and activate it, then run bash setup.sh
. Ensure your environment aligns with ./src/requirements.txt
.vllm==0.7.2
(for accelerated training), deepspeed
, wandb
, flash_attention_2
.Highlighted Details
Maintenance & Community
The project is actively maintained with recent updates in February 2025, including support for new models and bug fixes. The team welcomes community contributions and ideas, particularly for issues marked "help wanted."
Licensing & Compatibility
The repository does not explicitly state a license in the README. However, it acknowledges contributions from various projects with different licenses (e.g., Apache 2.0 for DeepSeek, MIT for QwenVL). Users should verify licensing for commercial use.
Limitations & Caveats
A bug related to batched training was noted, with a recommendation to use per_device_train_batch_size=1
for reproduction. OOM errors can occur, suggesting a reduction in --num_generations
or using vLLM. The project also notes that enforcing Chain-of-Thought reasoning might be detrimental to smaller models.
2 months ago
1 day