elimination_game  by lechmazur

LLM benchmark for social reasoning, strategy, and deception

created 5 months ago
284 stars

Top 93.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository introduces the Elimination Game, a multi-player benchmark designed to evaluate Large Language Models (LLMs) on social reasoning, strategy, and deception. It pits 8 LLMs against each other in rounds of public and private communication, alliance formation, and voting to eliminate peers, culminating in a jury vote for the winner. The benchmark aims to uncover how LLMs navigate complex social dynamics and strategic decision-making.

How It Works

The game simulates a tournament where LLMs act as players. Each round involves a public subround for open communication, preference rankings for private pairings, private subrounds for alliance discussions, and anonymous voting for elimination. Tie-breaking mechanisms are in place for close votes. The core innovation lies in its multi-faceted evaluation of LLM capabilities beyond simple task completion, focusing on emergent behaviors in social contexts, strategic foresight, and the ability to manage hidden intentions and alliances.

Quick Start & Requirements

The README does not provide specific installation or execution commands, nor does it detail explicit software or hardware requirements beyond the implicit need to run LLMs.

Highlighted Details

  • Comprehensive Metrics: Tracks TrueSkill ratings, rank distributions, "buddy betrayal" rates (both betrayer and victim perspectives), first-place counts, earliest elimination rates, final 2 win rates, and model wordiness.
  • Detailed Visualizations: Offers round-by-round replays of public chat, private chats, voting, and jury decisions.
  • Extensive LLM Testing: Includes benchmark results for numerous leading LLMs, such as Gemini, Grok, Claude, GPT-4 variants, Llama, Mistral, and others, providing a rich dataset for comparative analysis.
  • Emergent Behavior Analysis: Showcases example dialogues demonstrating complex strategies like coded communication, alliance manipulation, and deceptive tactics employed by the LLMs.

Maintenance & Community

The project is maintained by lechmazur, who can be followed on GitHub for updates. No specific community channels (Discord, Slack) or roadmap details are provided in the README.

Licensing & Compatibility

The README does not specify a license.

Limitations & Caveats

The README lacks explicit installation instructions, dependency lists, and licensing information, which are critical for adoption and reproducibility. The benchmark's complexity and reliance on LLM API access or local hosting imply significant computational resources and setup effort.

Health Check
Last commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
25 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n) and Travis Fischer Travis Fischer(Founder of Agentic).

AI_Diplomacy by EveryInc

1.3%
535
AI agents for turn-based strategy game Diplomacy
created 5 months ago
updated 2 days ago
Feedback? Help us improve.