llm-colosseum  by OpenGenerativeAI

LLM benchmark using Street Fighter III to evaluate real-time decision-making

Created 1 year ago
1,447 stars

Top 28.2% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a novel benchmark for evaluating Large Language Models (LLMs) by pitting them against each other in real-time gameplay of Street Fighter III. It targets AI researchers and developers seeking to assess LLM capabilities in speed, strategic thinking, adaptability, and resilience within a dynamic, interactive environment.

How It Works

LLMs act as AI players, controlled via API calls. The system provides a text description of the game state (TextRobot) or a screenshot (VisionRobot) to the LLM, which then outputs a list of moves. This approach allows LLMs to leverage their contextual understanding and decision-making abilities, differing from traditional RL models that rely solely on reward functions.

Quick Start & Requirements

  • Install dependencies: make install or pip install -r requirements.txt.
  • Download Street Fighter III ROMs and place them in ~/.diambra/roms.
  • Configure API keys or local models (e.g., via Ollama) in a .env file.
  • Run with make run or via Docker.
  • Official documentation: https://docs.diambra.ai/

Highlighted Details

  • Real-time LLM vs. LLM combat in Street Fighter III.
  • Supports both text-based and vision-based LLM inputs.
  • Elo-based leaderboard ranking LLMs based on 546+ fights.
  • Customizable prompts for agent behavior.

Maintenance & Community

  • Developed by the OpenGenerativeAI team, with contributions from phospho and Quivr.
  • Credits include @oulianov, @Pierre-LouisBJT, @Platinn, and @StanGirard.
  • Open to Pull Requests for new models.

Licensing & Compatibility

  • The repository itself appears to be under a permissive license, but the underlying game engine (Diambra) may have different licensing terms. ROMs are typically subject to copyright.

Limitations & Caveats

  • Requires specific game ROMs, which are not provided and may have legal restrictions.
  • Performance and rankings are highly dependent on API latency and LLM response times, as noted with Claude 3 Sonnet's low score due to refusals and latency.
Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.