VLM-R1  by om-ai-lab

VLM for visual understanding via reinforced VLMs

created 5 months ago
5,387 stars

Top 9.5% on sourcepulse

GitHubView on GitHub
Project Summary

VLM-R1 offers a stable and generalizable R1-style Large Vision-Language Model (VLM) framework, primarily targeting researchers and developers working on visual understanding tasks like Referring Expression Comprehension (REC), Open-Vocabulary Detection (OVD), and multimodal math reasoning. It provides a robust approach to enhance model generalization, particularly on out-of-domain data, by leveraging reinforcement learning (GRPO) over supervised fine-tuning (SFT).

How It Works

VLM-R1 implements the GRPO (Generalized Reinforcement Learning for Vision-Language) algorithm, which has demonstrated superior out-of-domain generalization compared to traditional SFT methods. The framework supports training with frozen vision modules or full fine-tuning, and offers LoRA fine-tuning for efficiency. It is designed to be flexible, allowing integration of various VLMs like Qwen2.5-VL and InternVL, and supports custom reward functions for specialized tasks.

Quick Start & Requirements

  • Install: Use conda create -n vlm-r1 python=3.10 and conda activate vlm-r1, then run bash setup.sh.
  • Prerequisites: Python 3.10, PyTorch, CUDA (implied for training), specific datasets (COCO Train2014, RefCOCO/+/g, LISA-Grounding) and their annotations.
  • Setup: Requires downloading and unzipping datasets. Training scripts are provided for GRPO and SFT (using LLaMA-Factory).
  • Links: REC Demo, OVD Demo, Tech Report.

Highlighted Details

  • Achieved SOTA on OVDEval and Top1 on OpenCompass Math Leaderboard (<4B parameters).
  • Demonstrates superior out-of-domain generalization for REC tasks compared to SFT.
  • Supports multi-node training, LoRA fine-tuning, and multi-image inputs.
  • Integrates with LLaMA-Factory for SFT training.

Maintenance & Community

The project is actively updated, with recent additions including improved logging, custom reward functions, and support for InternVL. Community interaction is encouraged via GitHub issues and pull requests.

Licensing & Compatibility

The repository appears to be primarily distributed under a permissive license, though specific dependencies might carry other terms. Compatibility for commercial use is generally good for permissive licenses, but users should verify individual component licenses.

Limitations & Caveats

The setup requires significant data preparation and understanding of the GRPO and SFT training paradigms. While supporting multiple VLMs, adding new models requires following a specific guide. The project is research-oriented, and production readiness may require further validation.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
9
Star History
578 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.