videogamebench  by alexzhang13

VLM benchmark for video game understanding

created 3 months ago
288 stars

Top 92.1% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a benchmark environment for evaluating Vision-Language Models (VLMs) on popular video games, targeting researchers and developers in AI and gaming. It enables standardized assessment of VLM multi-modal understanding and reasoning capabilities within interactive game contexts, offering a novel approach to testing AI agents in complex, dynamic environments.

How It Works

VideoGameBench integrates with emulators like PyBoy (Game Boy) and JS-DOS (MS-DOS) to create interactive game environments. It leverages LiteLLM for seamless integration with various VLM APIs, allowing models to perceive game states via screenshots and output actions (key presses, mouse clicks). The system supports asynchronous agent-game interaction, enabling real-time gameplay evaluation.

Quick Start & Requirements

  • Installation:
    conda create -n videogamebench python=3.10
    conda activate videogamebench
    pip install -r requirements.txt
    pip install -e .
    playwright install
    
  • Prerequisites: Python 3.10, Conda, Playwright. Requires API keys for specified VLMs. Game ROMs must be legally owned and placed in the roms/ folder.
  • Running Games:
    • Game Boy: python main.py --game <game_name> --model <vlm_model> (e.g., pokemon_red, gpt-4o)
    • DOS: python main.py --game <game_name> --model <vlm_model> (e.g., doom2, gemini/gemini-2.5-pro-preview-03-25)
  • Documentation: Website + Blog, arXiv

Highlighted Details

  • Supports Game Boy, MS-DOS, and browser games.
  • Integrates with numerous VLMs via LiteLLM (e.g., GPT-4o, Gemini, Claude).
  • Includes an optional GUI (--enable-ui) for monitoring agent actions and thoughts.
  • Provides scripts for replicating paper experiments and generating video clips from gameplay frames.
  • Offers a "lite mode" (--lite) that pauses the game while the model processes.

Maintenance & Community

  • Active development indicated by recent arXiv publication.
  • Community support available via Discord.

Licensing & Compatibility

  • Codebase is MIT licensed, permitting personal and commercial use.
  • Important: Users must legally own the game ROMs and assets to use them with the benchmark.

Limitations & Caveats

The benchmark currently supports specific sets of Game Boy and MS-DOS games, with additional games requiring manual configuration and code edits. Some advanced features like custom HTML templates for JS-DOS are available but may require deeper integration.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
82 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.