VLM benchmark for video game understanding
Top 92.1% on sourcepulse
This project provides a benchmark environment for evaluating Vision-Language Models (VLMs) on popular video games, targeting researchers and developers in AI and gaming. It enables standardized assessment of VLM multi-modal understanding and reasoning capabilities within interactive game contexts, offering a novel approach to testing AI agents in complex, dynamic environments.
How It Works
VideoGameBench integrates with emulators like PyBoy (Game Boy) and JS-DOS (MS-DOS) to create interactive game environments. It leverages LiteLLM for seamless integration with various VLM APIs, allowing models to perceive game states via screenshots and output actions (key presses, mouse clicks). The system supports asynchronous agent-game interaction, enabling real-time gameplay evaluation.
Quick Start & Requirements
conda create -n videogamebench python=3.10
conda activate videogamebench
pip install -r requirements.txt
pip install -e .
playwright install
roms/
folder.python main.py --game <game_name> --model <vlm_model>
(e.g., pokemon_red
, gpt-4o
)python main.py --game <game_name> --model <vlm_model>
(e.g., doom2
, gemini/gemini-2.5-pro-preview-03-25
)Highlighted Details
--enable-ui
) for monitoring agent actions and thoughts.--lite
) that pauses the game while the model processes.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The benchmark currently supports specific sets of Game Boy and MS-DOS games, with additional games requiring manual configuration and code edits. Some advanced features like custom HTML templates for JS-DOS are available but may require deeper integration.
2 months ago
Inactive