Discover and explore top open-source AI tools and projects—updated daily.
mll-lab-nuMulti-turn VLM agent training framework
Top 91.1% on SourcePulse
Summary
VAGEN is a multi-turn reinforcement learning framework for training Vision-Language Model (VLM) agents, tackling visual state ambiguity and precision bottlenecks in sequential tasks. It targets researchers and engineers aiming to enhance VLM agent performance in interactive environments by explicitly supervising visual reasoning.
How It Works
VAGEN models multi-turn tasks as POMDPs, employing Visual Reasoning RL. This approach features World Modeling Reasoning Prompts (future state prediction, LLM-as-Judge rewards) and Bi-Level GAE for fine-grained, turn-level rewards with distinct discount factors (intra-token, inter-turn). Key innovations include Selective Token Masking for optimizing critical tokens and Cross-turn Credit Assignment. A rollout.py module translates between environment data and model tokens.
Quick Start & Requirements
Requires Python 3.10 and Conda. Installation involves cloning verl and VAGEN, installing verl (pip install -e .), and running bash scripts/install.sh in VAGEN (installs Frozenlake/Sokoban deps). An OpenAI API key is needed for Visual Reasoning Reward setup. Usage examples are provided via shell scripts after wandb login. Guides for custom environments/services are referenced.
Highlighted Details
Maintenance & Community
Authored by a large team including Kangrui Wang, Li Fei-Fei, Yejin Choi, and Manling Li. Acknowledgements include RAGEN and verl. Paper URL: https://vagen-ai.github.io/. No direct community channels are listed.
Licensing & Compatibility
The specific open-source license is not stated in the README. Commercial use compatibility is not detailed.
Limitations & Caveats
Bi-Level GAE may be unstable in sparse reward settings. Installation script dependencies are limited to Frozenlake/Sokoban. OpenAI API key is required for full Visual Reasoning Reward functionality.
1 week ago
Inactive
KhoomeiK
THUDM