TheAgentCompany  by TheAgentCompany

Agent benchmark for real-world professional tasks in a simulated software company

created 1 year ago
509 stars

Top 62.1% on sourcepulse

GitHubView on GitHub
Project Summary

TheAgentCompany provides an extensible benchmark for evaluating Large Language Model (LLM) agents on real-world professional tasks. It targets researchers and developers building AI agents that interact with digital environments, offering a standardized way to measure performance in simulated software company roles.

How It Works

The benchmark simulates a digital workplace, enabling agents to perform tasks by browsing the web, writing and executing code, and communicating. It utilizes Docker containers for task environments, each containing an initialization script, task instructions, and a workspace. Evaluation is performed using provided scripts that assess agent performance based on results and intermediate checkpoints, supporting both deterministic and LLM-based grading.

Quick Start & Requirements

  • Setup: Requires Docker and Docker Compose. A setup.sh (Linux/Mac) or setup.bat (Windows) script automates the deployment of necessary services (GitLab, Plane, ownCloud, RocketChat) with pre-baked data.
  • Disk Space: 30+ GB free disk space recommended.
  • Resources: Baseline experiments used Amazon EC2 t3.2xlarge instances.
  • Documentation: Server Setup Doc, Evaluation Doc, Leaderboard Overview.

Highlighted Details

  • Diverse task roles: Software Engineer, Product Manager, Data Scientist, HR, Finance, Admin.
  • Multiple data types: Coding, conversational, mathematical, image processing, text comprehension.
  • Supports multiple agent interactions.
  • Comprehensive scoring system with result-based and sub-checkpoint evaluation.
  • Extensible framework for adding new tasks and evaluators.

Maintenance & Community

Contributions are welcomed via GitHub issues. Contact information for key contributors is provided.

Licensing & Compatibility

Distributed under the MIT License, permitting commercial use and closed-source linking.

Limitations & Caveats

Mac and Windows users may require specific configurations for host networking. The setup script requires root privileges for certain operations.

Health Check
Last commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
6
Star History
185 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Victor Taelin Victor Taelin(Author of Bend, Kind, HVM), and
4 more.

AgentBench by THUDM

0.6%
3k
Benchmark for evaluating LLMs as agents across diverse environments
created 2 years ago
updated 6 months ago
Feedback? Help us improve.