Agent benchmark for real-world professional tasks in a simulated software company
Top 62.1% on sourcepulse
TheAgentCompany provides an extensible benchmark for evaluating Large Language Model (LLM) agents on real-world professional tasks. It targets researchers and developers building AI agents that interact with digital environments, offering a standardized way to measure performance in simulated software company roles.
How It Works
The benchmark simulates a digital workplace, enabling agents to perform tasks by browsing the web, writing and executing code, and communicating. It utilizes Docker containers for task environments, each containing an initialization script, task instructions, and a workspace. Evaluation is performed using provided scripts that assess agent performance based on results and intermediate checkpoints, supporting both deterministic and LLM-based grading.
Quick Start & Requirements
setup.sh
(Linux/Mac) or setup.bat
(Windows) script automates the deployment of necessary services (GitLab, Plane, ownCloud, RocketChat) with pre-baked data.Highlighted Details
Maintenance & Community
Contributions are welcomed via GitHub issues. Contact information for key contributors is provided.
Licensing & Compatibility
Distributed under the MIT License, permitting commercial use and closed-source linking.
Limitations & Caveats
Mac and Windows users may require specific configurations for host networking. The setup script requires root privileges for certain operations.
3 days ago
1 day