videogamebench by alexzhang13

VLM benchmark for video game understanding

Created 9 months ago

320 stars

Top 84.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

John Yang

Coauthor of SWE-bench, SWE-agent

Project Summary

This project provides a benchmark environment for evaluating Vision-Language Models (VLMs) on popular video games, targeting researchers and developers in AI and gaming. It enables standardized assessment of VLM multi-modal understanding and reasoning capabilities within interactive game contexts, offering a novel approach to testing AI agents in complex, dynamic environments.

How It Works

VideoGameBench integrates with emulators like PyBoy (Game Boy) and JS-DOS (MS-DOS) to create interactive game environments. It leverages LiteLLM for seamless integration with various VLM APIs, allowing models to perceive game states via screenshots and output actions (key presses, mouse clicks). The system supports asynchronous agent-game interaction, enabling real-time gameplay evaluation.

Quick Start & Requirements

Installation:

conda create -n videogamebench python=3.10
conda activate videogamebench
pip install -r requirements.txt
pip install -e .
playwright install

Prerequisites: Python 3.10, Conda, Playwright. Requires API keys for specified VLMs. Game ROMs must be legally owned and placed in the roms/ folder.
Running Games:
- Game Boy: python main.py --game <game_name> --model <vlm_model> (e.g., pokemon_red, gpt-4o)
- DOS: python main.py --game <game_name> --model <vlm_model> (e.g., doom2, gemini/gemini-2.5-pro-preview-03-25)
Documentation: Website + Blog, arXiv

Highlighted Details

Supports Game Boy, MS-DOS, and browser games.
Integrates with numerous VLMs via LiteLLM (e.g., GPT-4o, Gemini, Claude).
Includes an optional GUI (--enable-ui) for monitoring agent actions and thoughts.
Provides scripts for replicating paper experiments and generating video clips from gameplay frames.
Offers a "lite mode" (--lite) that pauses the game while the model processes.

Maintenance & Community

Active development indicated by recent arXiv publication.
Community support available via Discord.

Licensing & Compatibility

Codebase is MIT licensed, permitting personal and commercial use.
Important: Users must legally own the game ROMs and assets to use them with the benchmark.

Limitations & Caveats

The benchmark currently supports specific sets of Game Boy and MS-DOS games, with additional games requiring manual configuration and code edits. Some advanced features like custom HTML templates for JS-DOS are available but may require deeper integration.

Health Check

Last Commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days