Visual-RFT  by Liuziyu77

Visual reinforcement fine-tuning research paper

created 5 months ago
2,114 stars

Top 21.7% on sourcepulse

GitHubView on GitHub
Project Summary

Visual-RFT introduces a novel reinforcement fine-tuning approach for Large Vision-Language Models (LVLMs), extending Deepseek-R1's RL strategy to multimodal tasks. It targets researchers and developers aiming to enhance LVLM performance on diverse visual perception tasks like object detection and fine-grained classification, offering a method for efficient, high-quality fine-tuning with limited data.

How It Works

Visual-RFT employs a GRPO-based reinforcement fine-tuning framework. The core innovation lies in its design of rule-based, verifiable reward functions tailored for specific visual tasks. These rewards efficiently compute quality scores for model-generated responses, which are then used to update the policy model. KL divergence is utilized to stabilize training by limiting deviations from a reference model, enabling effective adaptation of RL strategies to the multimodal domain.

Quick Start & Requirements

  • Install: Clone the repository and set up a Conda environment.
    git clone https://github.com/Liuziyu77/Visual-RFT.git
    conda create -n Visual-RFT python=3.10
    conda activate Visual-RFT
    bash setup.sh
    
  • Prerequisites: Python 3.10, PyTorch, DeepSpeed, Hugging Face Datasets, and Flash Attention 2 are recommended. Training requires significant GPU resources (e.g., 8 GPUs for COCO evaluation).
  • Resources: Training on small datasets (hundreds of samples) can be completed in ~200 steps.
  • Links: Paper, Datasets, Demo

Highlighted Details

  • Introduces Visual Reinforcement Fine-tuning (Visual-RFT) for LVLMs.
  • Features verifiable reward functions for efficient, high-quality fine-tuning.
  • Extends RL to tasks like Open Vocabulary Detection, Few-shot Detection, Reasoning Grounding, and Fine-grained Image Classification.
  • Provides open-sourced training code, data, and evaluation scripts.

Maintenance & Community

The project is actively maintained with recent releases of code, paper, and datasets in March 2025. Further community engagement details (e.g., Discord/Slack) are not explicitly provided in the README.

Licensing & Compatibility

  • Code License: Apache 2.0
  • Data License: CC BY-NC 4.0 (Attribution-NonCommercial 4.0 International)
  • Usage: Intended and licensed for research use only. Commercial use is restricted by the non-commercial clause and OpenAI's terms of use.

Limitations & Caveats

The data license restricts commercial use. The project relies on specific model checkpoints (Qwen2-VL-2B) and may require significant GPU memory, with OOM mitigation strategies provided. Evaluation for COCO, LVIS, and classification tasks explicitly states a requirement for at least two GPUs.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
7
Star History
513 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.