RL4VLM  by RL4VLM

Research paper for fine-tuning VLMs as decision-making agents via RL

created 1 year ago
376 stars

Top 76.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official codebase for fine-tuning large vision-language models (VLMs) as decision-making agents using reinforcement learning (RL). It targets researchers and practitioners aiming to adapt VLMs for interactive environments and complex task execution, offering a framework to bridge the gap between language understanding and embodied action.

How It Works

The project leverages a modified LLaVA architecture as the VLM backbone, integrating it with Proximal Policy Optimization (PPO) for RL fine-tuning. This approach allows the VLM to learn policies for decision-making within interactive environments like GymCards and ALFWorld. The core idea is to treat the VLM's output as actions, enabling it to learn from environmental feedback and improve task performance.

Quick Start & Requirements

  • Installation: Requires setting up separate conda environments for GymCards and ALFWorld due to package discrepancies.
  • Prerequisites: LLaVA 1.6 Mistral 7B checkpoint is recommended as a starting point. The codebase relies on specific tokenizer versions, and users may need to manually verify token IDs for actions.
  • Resources: Training involves a two-step process: Supervised Fine-Tuning (SFT) followed by RL fine-tuning. The number of GPUs for RL can be configured via config_zero2.yaml.
  • Links: Paper, Project Page, Wandb Report, Data Release

Highlighted Details

  • Fine-tunes VLMs as decision-making agents using RL.
  • Supports interactive environments like GymCards and ALFWorld.
  • Provides scripts for both SFT and RL fine-tuning stages.
  • Offers a template for using the GymCards environment in pure text.

Maintenance & Community

The project has been archived due to maintenance challenges with package dependencies. The authors suggest using updated backbone models (e.g., QWen, Llama3.2V) and customizing environments and RL configurations for future work.

Licensing & Compatibility

MIT License. Compatible with commercial use and closed-source linking.

Limitations & Caveats

The codebase has been archived, and package dependencies may be outdated. Users might encounter issues with tokenizer version compatibility, requiring manual adjustments for token IDs. The authors note that using multiple GPUs for ALFWorld RL can lead to synchronization issues and recommend using a single GPU.

Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
29 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

LlamaGym by KhoomeiK

0.3%
1k
SDK for fine-tuning LLM agents with online reinforcement learning
created 1 year ago
updated 1 year ago
Feedback? Help us improve.