elimination_game by lechmazur

LLM benchmark for social reasoning, strategy, and deception

Created 10 months ago

297 stars

Top 89.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jiayi Pan

Author of SWE-Gym; MTS at xAI

Project Summary

This repository introduces the Elimination Game, a multi-player benchmark designed to evaluate Large Language Models (LLMs) on social reasoning, strategy, and deception. It pits 8 LLMs against each other in rounds of public and private communication, alliance formation, and voting to eliminate peers, culminating in a jury vote for the winner. The benchmark aims to uncover how LLMs navigate complex social dynamics and strategic decision-making.

How It Works

The game simulates a tournament where LLMs act as players. Each round involves a public subround for open communication, preference rankings for private pairings, private subrounds for alliance discussions, and anonymous voting for elimination. Tie-breaking mechanisms are in place for close votes. The core innovation lies in its multi-faceted evaluation of LLM capabilities beyond simple task completion, focusing on emergent behaviors in social contexts, strategic foresight, and the ability to manage hidden intentions and alliances.

Quick Start & Requirements

The README does not provide specific installation or execution commands, nor does it detail explicit software or hardware requirements beyond the implicit need to run LLMs.

Highlighted Details

Comprehensive Metrics: Tracks TrueSkill ratings, rank distributions, "buddy betrayal" rates (both betrayer and victim perspectives), first-place counts, earliest elimination rates, final 2 win rates, and model wordiness.
Detailed Visualizations: Offers round-by-round replays of public chat, private chats, voting, and jury decisions.
Extensive LLM Testing: Includes benchmark results for numerous leading LLMs, such as Gemini, Grok, Claude, GPT-4 variants, Llama, Mistral, and others, providing a rich dataset for comparative analysis.
Emergent Behavior Analysis: Showcases example dialogues demonstrating complex strategies like coded communication, alliance manipulation, and deceptive tactics employed by the LLMs.

Maintenance & Community

The project is maintained by lechmazur, who can be followed on GitHub for updates. No specific community channels (Discord, Slack) or roadmap details are provided in the README.

Licensing & Compatibility

The README does not specify a license.

Limitations & Caveats

The README lacks explicit installation instructions, dependency lists, and licensing information, which are critical for adoption and reproducibility. The benchmark's complexity and reliance on LLM API access or local hosting imply significant computational resources and setup effort.

elimination_game by lechmazur

Explore Similar Projects

p2l by lmarena

verdict by haizelabs

awesome-LLM-game-agent-papers by git-disl

JudgeLM by baaivision

LLMRank by RUCAIBox

ChatEval by thunlp

llm-leaderboard by LudwigStumpp

AI_Diplomacy by GoodStartLabs

liars-bar-llm by LYiHub

arena-hard-auto by lmarena

llm-colosseum by OpenGenerativeAI

FastChat by lm-sys