VAGEN by mll-lab-nu

Multi-turn VLM agent training framework

Created 10 months ago

368 stars

Top 76.7% on SourcePulse

Project Summary

Summary

VAGEN is a multi-turn reinforcement learning framework for training Vision-Language Model (VLM) agents, tackling visual state ambiguity and precision bottlenecks in sequential tasks. It targets researchers and engineers aiming to enhance VLM agent performance in interactive environments by explicitly supervising visual reasoning.

How It Works

VAGEN models multi-turn tasks as POMDPs, employing Visual Reasoning RL. This approach features World Modeling Reasoning Prompts (future state prediction, LLM-as-Judge rewards) and Bi-Level GAE for fine-grained, turn-level rewards with distinct discount factors (intra-token, inter-turn). Key innovations include Selective Token Masking for optimizing critical tokens and Cross-turn Credit Assignment. A rollout.py module translates between environment data and model tokens.

Quick Start & Requirements

Requires Python 3.10 and Conda. Installation involves cloning verl and VAGEN, installing verl (pip install -e .), and running bash scripts/install.sh in VAGEN (installs Frozenlake/Sokoban deps). An OpenAI API key is needed for Visual Reasoning Reward setup. Usage examples are provided via shell scripts after wandb login. Guides for custom environments/services are referenced.

Highlighted Details

Benchmarks models on five environments: sokoban, frozenlake, svg, navigation, primitive skill.
VAGEN-Full configuration shows robust, stable performance.
WorldModeling Reward consistently boosts performance via visual learning signals.
Bi-Level GAE offers notable gains but can be unstable with sparse rewards.

Maintenance & Community

Authored by a large team including Kangrui Wang, Li Fei-Fei, Yejin Choi, and Manling Li. Acknowledgements include RAGEN and verl. Paper URL: https://vagen-ai.github.io/. No direct community channels are listed.

Licensing & Compatibility

The specific open-source license is not stated in the README. Commercial use compatibility is not detailed.

Limitations & Caveats

Bi-Level GAE may be unstable in sparse reward settings. Installation script dependencies are limited to Frozenlake/Sokoban. OpenAI API key is required for full Visual Reasoning Reward functionality.

VAGEN by mll-lab-nu

Explore Similar Projects

sweet_rl by facebookresearch

KnowAgent by zjunlp

agent-actors by shaman-ai

SwiftSage by SwiftSage

Awesome-Papers-Autonomous-Agent by lafmdp

LlamaGym by KhoomeiK

DeepEyes by Visual-Agent

Agent-R1 by 0russwest0

AgentTuning by THUDM

flow_grpo by yifan123

RL-Factory by Simple-Efficient

agent-lightning by microsoft