llm-colosseum  by OpenGenerativeAI

LLM benchmark using Street Fighter III to evaluate real-time decision-making

created 1 year ago
1,444 stars

Top 28.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a novel benchmark for evaluating Large Language Models (LLMs) by pitting them against each other in real-time gameplay of Street Fighter III. It targets AI researchers and developers seeking to assess LLM capabilities in speed, strategic thinking, adaptability, and resilience within a dynamic, interactive environment.

How It Works

LLMs act as AI players, controlled via API calls. The system provides a text description of the game state (TextRobot) or a screenshot (VisionRobot) to the LLM, which then outputs a list of moves. This approach allows LLMs to leverage their contextual understanding and decision-making abilities, differing from traditional RL models that rely solely on reward functions.

Quick Start & Requirements

  • Install dependencies: make install or pip install -r requirements.txt.
  • Download Street Fighter III ROMs and place them in ~/.diambra/roms.
  • Configure API keys or local models (e.g., via Ollama) in a .env file.
  • Run with make run or via Docker.
  • Official documentation: https://docs.diambra.ai/

Highlighted Details

  • Real-time LLM vs. LLM combat in Street Fighter III.
  • Supports both text-based and vision-based LLM inputs.
  • Elo-based leaderboard ranking LLMs based on 546+ fights.
  • Customizable prompts for agent behavior.

Maintenance & Community

  • Developed by the OpenGenerativeAI team, with contributions from phospho and Quivr.
  • Credits include @oulianov, @Pierre-LouisBJT, @Platinn, and @StanGirard.
  • Open to Pull Requests for new models.

Licensing & Compatibility

  • The repository itself appears to be under a permissive license, but the underlying game engine (Diambra) may have different licensing terms. ROMs are typically subject to copyright.

Limitations & Caveats

  • Requires specific game ROMs, which are not provided and may have legal restrictions.
  • Performance and rankings are highly dependent on API latency and LLM response times, as noted with Claude 3 Sonnet's low score due to refusals and latency.
Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
25 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

LlamaGym by KhoomeiK

0.3%
1k
SDK for fine-tuning LLM agents with online reinforcement learning
created 1 year ago
updated 1 year ago
Feedback? Help us improve.