llm-colosseum by OpenGenerativeAI

LLM benchmark using Street Fighter III to evaluate real-time decision-making

Created 1 year ago

1,461 stars

Top 27.8% on SourcePulse

2 Experts Love This Project

chiphuyen

Author of "AI Engineering", "Designing Machine Learning Systems"

JustinLin610

Core Maintainer at Alibaba Qwen

Project Summary

This project provides a novel benchmark for evaluating Large Language Models (LLMs) by pitting them against each other in real-time gameplay of Street Fighter III. It targets AI researchers and developers seeking to assess LLM capabilities in speed, strategic thinking, adaptability, and resilience within a dynamic, interactive environment.

How It Works

LLMs act as AI players, controlled via API calls. The system provides a text description of the game state (TextRobot) or a screenshot (VisionRobot) to the LLM, which then outputs a list of moves. This approach allows LLMs to leverage their contextual understanding and decision-making abilities, differing from traditional RL models that rely solely on reward functions.

Quick Start & Requirements

Install dependencies: make install or pip install -r requirements.txt.
Download Street Fighter III ROMs and place them in ~/.diambra/roms.
Configure API keys or local models (e.g., via Ollama) in a .env file.
Run with make run or via Docker.
Official documentation: https://docs.diambra.ai/

Highlighted Details

Real-time LLM vs. LLM combat in Street Fighter III.
Supports both text-based and vision-based LLM inputs.
Elo-based leaderboard ranking LLMs based on 546+ fights.
Customizable prompts for agent behavior.

Maintenance & Community

Developed by the OpenGenerativeAI team, with contributions from phospho and Quivr.
Credits include @oulianov, @Pierre-LouisBJT, @Platinn, and @StanGirard.
Open to Pull Requests for new models.

Licensing & Compatibility

The repository itself appears to be under a permissive license, but the underlying game engine (Diambra) may have different licensing terms. ROMs are typically subject to copyright.

Limitations & Caveats

Requires specific game ROMs, which are not provided and may have legal restrictions.
Performance and rankings are highly dependent on API latency and LLM response times, as noted with Claude 3 Sonnet's low score due to refusals and latency.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

8 stars in the last 30 days

Explore Similar Projects

awesome-LLM-game-agent-papers by git-disl

LLM game agent paper survey

Created 1 year ago

Updated 2 months ago

Starred by

John Yang

John Yang(Coauthor of SWE-bench, SWE-agent).

videogamebench by alexzhang13

VLM benchmark for video game understanding

Created 9 months ago

Updated 7 months ago

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect).

LLM-RL-Papers by WindyLab

LLM & RL research paper monitor for combining capabilities for control

Created 1 year ago

Updated 1 month ago

Starred by

Will Brown

Will Brown(Research Lead at Prime Intellect) and

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI).

AgentBoard by hkust-nlp

Analytical evaluation board for multi-turn LLM agents

Created 2 years ago

Updated 1 year ago

Starred by

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow) and

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect).

factorio-learning-environment by JackHopkins

Framework for evaluating LLM agents in Factorio

Created 4 years ago

Updated 5 days ago

Black-Myth-Wukong-AI by Turing-Project

RL-based game bot for ARPG/Souls-like games

Created 1 year ago

Updated 1 year ago

Starred by

Will Brown

Will Brown(Research Lead at Prime Intellect) and

Logan Kilpatrick

Logan Kilpatrick(Product Lead on Google AI Studio).

TextArena by TextArena

LLM evaluation and training framework

Created 1 year ago

Updated 3 days ago

MahjongAI by erreurt

Mahjong AI agent for Japanese Riichi, designed as a framework

Created 7 years ago

Updated 7 years ago

Starred by

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs).

slimevolleygym by hardmaru

Gym environment for single/multi-agent RL research

Created 5 years ago

Updated 2 years ago

AgentSociety by tsinghua-fib-lab

Agent-based simulation framework for modeling human behavior in urban environments

Created 11 months ago

Updated 2 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

Awesome-LLM-Post-training by mbzuai-oryx

Curated list of LLM post-training resources

Created 11 months ago

Updated 2 months ago

Starred by

Zhiqiang Xie

Zhiqiang Xie(Coauthor of SGLang),

Travis Fischer

Travis Fischer(Founder of Agentic), and

5 more.

LLM-Agent-Paper-List by WooooDyy

Paper list for LLM-based agents

Created 2 years ago

Updated 4 months ago

Feedback? Help us improve.