VLM-R1 by om-ai-lab

VLM for visual understanding via reinforced VLMs

Created 11 months ago

5,798 stars

Top 8.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

VLM-R1 offers a stable and generalizable R1-style Large Vision-Language Model (VLM) framework, primarily targeting researchers and developers working on visual understanding tasks like Referring Expression Comprehension (REC), Open-Vocabulary Detection (OVD), and multimodal math reasoning. It provides a robust approach to enhance model generalization, particularly on out-of-domain data, by leveraging reinforcement learning (GRPO) over supervised fine-tuning (SFT).

How It Works

VLM-R1 implements the GRPO (Generalized Reinforcement Learning for Vision-Language) algorithm, which has demonstrated superior out-of-domain generalization compared to traditional SFT methods. The framework supports training with frozen vision modules or full fine-tuning, and offers LoRA fine-tuning for efficiency. It is designed to be flexible, allowing integration of various VLMs like Qwen2.5-VL and InternVL, and supports custom reward functions for specialized tasks.

Quick Start & Requirements

Install: Use conda create -n vlm-r1 python=3.10 and conda activate vlm-r1, then run bash setup.sh.
Prerequisites: Python 3.10, PyTorch, CUDA (implied for training), specific datasets (COCO Train2014, RefCOCO/+/g, LISA-Grounding) and their annotations.
Setup: Requires downloading and unzipping datasets. Training scripts are provided for GRPO and SFT (using LLaMA-Factory).
Links: REC Demo, OVD Demo, Tech Report.

Highlighted Details

Achieved SOTA on OVDEval and Top1 on OpenCompass Math Leaderboard (<4B parameters).
Demonstrates superior out-of-domain generalization for REC tasks compared to SFT.
Supports multi-node training, LoRA fine-tuning, and multi-image inputs.
Integrates with LLaMA-Factory for SFT training.

Maintenance & Community

The project is actively updated, with recent additions including improved logging, custom reward functions, and support for InternVL. Community interaction is encouraged via GitHub issues and pull requests.

Licensing & Compatibility

The repository appears to be primarily distributed under a permissive license, though specific dependencies might carry other terms. Compatibility for commercial use is generally good for permissive licenses, but users should verify individual component licenses.

Limitations & Caveats

The setup requires significant data preparation and understanding of the GRPO and SFT training paradigms. While supporting multiple VLMs, adding new models requires following a specific guide. The project is research-oriented, and production readiness may require further validation.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

64 stars in the last 30 days