TheAgentCompany  by TheAgentCompany

Agent benchmark for real-world professional tasks in a simulated software company

Created 1 year ago
548 stars

Top 58.3% on SourcePulse

GitHubView on GitHub
Project Summary

TheAgentCompany provides an extensible benchmark for evaluating Large Language Model (LLM) agents on real-world professional tasks. It targets researchers and developers building AI agents that interact with digital environments, offering a standardized way to measure performance in simulated software company roles.

How It Works

The benchmark simulates a digital workplace, enabling agents to perform tasks by browsing the web, writing and executing code, and communicating. It utilizes Docker containers for task environments, each containing an initialization script, task instructions, and a workspace. Evaluation is performed using provided scripts that assess agent performance based on results and intermediate checkpoints, supporting both deterministic and LLM-based grading.

Quick Start & Requirements

  • Setup: Requires Docker and Docker Compose. A setup.sh (Linux/Mac) or setup.bat (Windows) script automates the deployment of necessary services (GitLab, Plane, ownCloud, RocketChat) with pre-baked data.
  • Disk Space: 30+ GB free disk space recommended.
  • Resources: Baseline experiments used Amazon EC2 t3.2xlarge instances.
  • Documentation: Server Setup Doc, Evaluation Doc, Leaderboard Overview.

Highlighted Details

  • Diverse task roles: Software Engineer, Product Manager, Data Scientist, HR, Finance, Admin.
  • Multiple data types: Coding, conversational, mathematical, image processing, text comprehension.
  • Supports multiple agent interactions.
  • Comprehensive scoring system with result-based and sub-checkpoint evaluation.
  • Extensible framework for adding new tasks and evaluators.

Maintenance & Community

Contributions are welcomed via GitHub issues. Contact information for key contributors is provided.

Licensing & Compatibility

Distributed under the MIT License, permitting commercial use and closed-source linking.

Limitations & Caveats

Mac and Windows users may require specific configurations for host networking. The setup script requires root privileges for certain operations.

Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
5
Star History
21 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
2 more.

tau-bench by sierra-research

1.4%
840
Benchmark for tool-agent-user interaction research
Created 1 year ago
Updated 3 weeks ago
Feedback? Help us improve.