RL4VLM by RL4VLM

Research paper for fine-tuning VLMs as decision-making agents via RL

Created 1 year ago

404 stars

Top 71.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jiayi Pan

Author of SWE-Gym; MTS at xAI

Project Summary

This repository provides the official codebase for fine-tuning large vision-language models (VLMs) as decision-making agents using reinforcement learning (RL). It targets researchers and practitioners aiming to adapt VLMs for interactive environments and complex task execution, offering a framework to bridge the gap between language understanding and embodied action.

How It Works

The project leverages a modified LLaVA architecture as the VLM backbone, integrating it with Proximal Policy Optimization (PPO) for RL fine-tuning. This approach allows the VLM to learn policies for decision-making within interactive environments like GymCards and ALFWorld. The core idea is to treat the VLM's output as actions, enabling it to learn from environmental feedback and improve task performance.

Quick Start & Requirements

Installation: Requires setting up separate conda environments for GymCards and ALFWorld due to package discrepancies.
Prerequisites: LLaVA 1.6 Mistral 7B checkpoint is recommended as a starting point. The codebase relies on specific tokenizer versions, and users may need to manually verify token IDs for actions.
Resources: Training involves a two-step process: Supervised Fine-Tuning (SFT) followed by RL fine-tuning. The number of GPUs for RL can be configured via config_zero2.yaml.
Links: Paper, Project Page, Wandb Report, Data Release

Highlighted Details

Fine-tunes VLMs as decision-making agents using RL.
Supports interactive environments like GymCards and ALFWorld.
Provides scripts for both SFT and RL fine-tuning stages.
Offers a template for using the GymCards environment in pure text.

Maintenance & Community

The project has been archived due to maintenance challenges with package dependencies. The authors suggest using updated backbone models (e.g., QWen, Llama3.2V) and customizing environments and RL configurations for future work.

Licensing & Compatibility

MIT License. Compatible with commercial use and closed-source linking.

Limitations & Caveats

The codebase has been archived, and package dependencies may be outdated. Users might encounter issues with tokenizer version compatibility, requiring manual adjustments for token IDs. The authors note that using multiple GPUs for ALFWorld RL can lead to synchronization issues and recommend using a single GPU.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days