VLM-R1  by om-ai-lab

VLM for visual understanding via reinforced VLMs

Created 7 months ago
5,540 stars

Top 9.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

VLM-R1 offers a stable and generalizable R1-style Large Vision-Language Model (VLM) framework, primarily targeting researchers and developers working on visual understanding tasks like Referring Expression Comprehension (REC), Open-Vocabulary Detection (OVD), and multimodal math reasoning. It provides a robust approach to enhance model generalization, particularly on out-of-domain data, by leveraging reinforcement learning (GRPO) over supervised fine-tuning (SFT).

How It Works

VLM-R1 implements the GRPO (Generalized Reinforcement Learning for Vision-Language) algorithm, which has demonstrated superior out-of-domain generalization compared to traditional SFT methods. The framework supports training with frozen vision modules or full fine-tuning, and offers LoRA fine-tuning for efficiency. It is designed to be flexible, allowing integration of various VLMs like Qwen2.5-VL and InternVL, and supports custom reward functions for specialized tasks.

Quick Start & Requirements

  • Install: Use conda create -n vlm-r1 python=3.10 and conda activate vlm-r1, then run bash setup.sh.
  • Prerequisites: Python 3.10, PyTorch, CUDA (implied for training), specific datasets (COCO Train2014, RefCOCO/+/g, LISA-Grounding) and their annotations.
  • Setup: Requires downloading and unzipping datasets. Training scripts are provided for GRPO and SFT (using LLaMA-Factory).
  • Links: REC Demo, OVD Demo, Tech Report.

Highlighted Details

  • Achieved SOTA on OVDEval and Top1 on OpenCompass Math Leaderboard (<4B parameters).
  • Demonstrates superior out-of-domain generalization for REC tasks compared to SFT.
  • Supports multi-node training, LoRA fine-tuning, and multi-image inputs.
  • Integrates with LLaMA-Factory for SFT training.

Maintenance & Community

The project is actively updated, with recent additions including improved logging, custom reward functions, and support for InternVL. Community interaction is encouraged via GitHub issues and pull requests.

Licensing & Compatibility

The repository appears to be primarily distributed under a permissive license, though specific dependencies might carry other terms. Compatibility for commercial use is generally good for permissive licenses, but users should verify individual component licenses.

Limitations & Caveats

The setup requires significant data preparation and understanding of the GRPO and SFT training paradigms. While supporting multiple VLMs, adding new models requires following a specific guide. The project is research-oriented, and production readiness may require further validation.

Health Check
Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
10
Star History
86 stars in the last 30 days

Explore Similar Projects

Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
4 more.

simpleRL-reason by hkust-nlp

0.1%
4k
RL recipe for reasoning ability in models
Created 7 months ago
Updated 1 month ago
Feedback? Help us improve.