Visual-RFT by Liuziyu77

Visual reinforcement fine-tuning research paper

Created 10 months ago

2,298 stars

Top 19.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Zack Li

Cofounder of Nexa AI

Project Summary

Visual-RFT introduces a novel reinforcement fine-tuning approach for Large Vision-Language Models (LVLMs), extending Deepseek-R1's RL strategy to multimodal tasks. It targets researchers and developers aiming to enhance LVLM performance on diverse visual perception tasks like object detection and fine-grained classification, offering a method for efficient, high-quality fine-tuning with limited data.

How It Works

Visual-RFT employs a GRPO-based reinforcement fine-tuning framework. The core innovation lies in its design of rule-based, verifiable reward functions tailored for specific visual tasks. These rewards efficiently compute quality scores for model-generated responses, which are then used to update the policy model. KL divergence is utilized to stabilize training by limiting deviations from a reference model, enabling effective adaptation of RL strategies to the multimodal domain.

Quick Start & Requirements

Install: Clone the repository and set up a Conda environment.

git clone https://github.com/Liuziyu77/Visual-RFT.git
conda create -n Visual-RFT python=3.10
conda activate Visual-RFT
bash setup.sh

Prerequisites: Python 3.10, PyTorch, DeepSpeed, Hugging Face Datasets, and Flash Attention 2 are recommended. Training requires significant GPU resources (e.g., 8 GPUs for COCO evaluation).
Resources: Training on small datasets (hundreds of samples) can be completed in ~200 steps.
Links: Paper, Datasets, Demo

Highlighted Details

Introduces Visual Reinforcement Fine-tuning (Visual-RFT) for LVLMs.
Features verifiable reward functions for efficient, high-quality fine-tuning.
Extends RL to tasks like Open Vocabulary Detection, Few-shot Detection, Reasoning Grounding, and Fine-grained Image Classification.
Provides open-sourced training code, data, and evaluation scripts.

Maintenance & Community

The project is actively maintained with recent releases of code, paper, and datasets in March 2025. Further community engagement details (e.g., Discord/Slack) are not explicitly provided in the README.

Licensing & Compatibility

Code License: Apache 2.0
Data License: CC BY-NC 4.0 (Attribution-NonCommercial 4.0 International)
Usage: Intended and licensed for research use only. Commercial use is restricted by the non-commercial clause and OpenAI's terms of use.

Limitations & Caveats

The data license restricts commercial use. The project relies on specific model checkpoints (Qwen2-VL-2B) and may require significant GPU memory, with OOM mitigation strategies provided. Evaluation for COCO, LVIS, and classification tasks explicitly states a requirement for at least two GPUs.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

20 stars in the last 30 days