TheAgentCompany by TheAgentCompany

Agent benchmark for real-world professional tasks in a simulated software company

Created 1 year ago

643 stars

Top 51.7% on SourcePulse

View on GitHub

4 Experts Love This Project

Simon Willison

Coauthor of Django

Vincent Weisser

Cofounder of Prime Intellect

Gregor Zunic

Cofounder of Browser Use

Robert Stojnic

Cocreator of Papers with Code

Project Summary

TheAgentCompany provides an extensible benchmark for evaluating Large Language Model (LLM) agents on real-world professional tasks. It targets researchers and developers building AI agents that interact with digital environments, offering a standardized way to measure performance in simulated software company roles.

How It Works

The benchmark simulates a digital workplace, enabling agents to perform tasks by browsing the web, writing and executing code, and communicating. It utilizes Docker containers for task environments, each containing an initialization script, task instructions, and a workspace. Evaluation is performed using provided scripts that assess agent performance based on results and intermediate checkpoints, supporting both deterministic and LLM-based grading.

Quick Start & Requirements

Setup: Requires Docker and Docker Compose. A setup.sh (Linux/Mac) or setup.bat (Windows) script automates the deployment of necessary services (GitLab, Plane, ownCloud, RocketChat) with pre-baked data.
Disk Space: 30+ GB free disk space recommended.
Resources: Baseline experiments used Amazon EC2 t3.2xlarge instances.
Documentation: Server Setup Doc, Evaluation Doc, Leaderboard Overview.

Highlighted Details

Diverse task roles: Software Engineer, Product Manager, Data Scientist, HR, Finance, Admin.
Multiple data types: Coding, conversational, mathematical, image processing, text comprehension.
Supports multiple agent interactions.
Comprehensive scoring system with result-based and sub-checkpoint evaluation.
Extensible framework for adding new tasks and evaluators.

Maintenance & Community

Contributions are welcomed via GitHub issues. Contact information for key contributors is provided.

Licensing & Compatibility

Distributed under the MIT License, permitting commercial use and closed-source linking.

Limitations & Caveats

Mac and Windows users may require specific configurations for host networking. The setup script requires root privileges for certain operations.

Health Check

Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)