VAGEN  by mll-lab-nu

Multi-turn VLM agent training framework

Created 8 months ago
288 stars

Top 91.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

VAGEN is a multi-turn reinforcement learning framework for training Vision-Language Model (VLM) agents, tackling visual state ambiguity and precision bottlenecks in sequential tasks. It targets researchers and engineers aiming to enhance VLM agent performance in interactive environments by explicitly supervising visual reasoning.

How It Works

VAGEN models multi-turn tasks as POMDPs, employing Visual Reasoning RL. This approach features World Modeling Reasoning Prompts (future state prediction, LLM-as-Judge rewards) and Bi-Level GAE for fine-grained, turn-level rewards with distinct discount factors (intra-token, inter-turn). Key innovations include Selective Token Masking for optimizing critical tokens and Cross-turn Credit Assignment. A rollout.py module translates between environment data and model tokens.

Quick Start & Requirements

Requires Python 3.10 and Conda. Installation involves cloning verl and VAGEN, installing verl (pip install -e .), and running bash scripts/install.sh in VAGEN (installs Frozenlake/Sokoban deps). An OpenAI API key is needed for Visual Reasoning Reward setup. Usage examples are provided via shell scripts after wandb login. Guides for custom environments/services are referenced.

Highlighted Details

  • Benchmarks models on five environments: sokoban, frozenlake, svg, navigation, primitive skill.
  • VAGEN-Full configuration shows robust, stable performance.
  • WorldModeling Reward consistently boosts performance via visual learning signals.
  • Bi-Level GAE offers notable gains but can be unstable with sparse rewards.

Maintenance & Community

Authored by a large team including Kangrui Wang, Li Fei-Fei, Yejin Choi, and Manling Li. Acknowledgements include RAGEN and verl. Paper URL: https://vagen-ai.github.io/. No direct community channels are listed.

Licensing & Compatibility

The specific open-source license is not stated in the README. Commercial use compatibility is not detailed.

Limitations & Caveats

Bi-Level GAE may be unstable in sparse reward settings. Installation script dependencies are limited to Frozenlake/Sokoban. OpenAI API key is required for full Visual Reasoning Reward functionality.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
70 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.