Research paper for fine-tuning VLMs as decision-making agents via RL
Top 76.7% on sourcepulse
This repository provides the official codebase for fine-tuning large vision-language models (VLMs) as decision-making agents using reinforcement learning (RL). It targets researchers and practitioners aiming to adapt VLMs for interactive environments and complex task execution, offering a framework to bridge the gap between language understanding and embodied action.
How It Works
The project leverages a modified LLaVA architecture as the VLM backbone, integrating it with Proximal Policy Optimization (PPO) for RL fine-tuning. This approach allows the VLM to learn policies for decision-making within interactive environments like GymCards and ALFWorld. The core idea is to treat the VLM's output as actions, enabling it to learn from environmental feedback and improve task performance.
Quick Start & Requirements
config_zero2.yaml
.Highlighted Details
Maintenance & Community
The project has been archived due to maintenance challenges with package dependencies. The authors suggest using updated backbone models (e.g., QWen, Llama3.2V) and customizing environments and RL configurations for future work.
Licensing & Compatibility
MIT License. Compatible with commercial use and closed-source linking.
Limitations & Caveats
The codebase has been archived, and package dependencies may be outdated. Users might encounter issues with tokenizer version compatibility, requiring manual adjustments for token IDs. The authors note that using multiple GPUs for ALFWorld RL can lead to synchronization issues and recommend using a single GPU.
7 months ago
1 day