appworld by StonyBrookNLP

Controllable world for benchmarking interactive coding agents

Created 1 year ago

388 stars

Top 74.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Elvis Saravia

Founder of DAIR.AI

Project Summary

AppWorld provides a high-fidelity, controllable simulated environment for benchmarking interactive coding agents. It features 9 day-to-day applications with over 450 APIs, simulating ~100 users, enabling agents to perform complex, interactive coding tasks. This platform offers a standardized benchmark for evaluating agent capabilities in realistic scenarios.

How It Works

The system simulates a world of apps and people, allowing agents to interact via Python code making API calls. It supports stateful execution, maintaining context across interactions, and provides a comprehensive set of 457 APIs across 9 applications. This design facilitates the development and rigorous evaluation of agents for complex, multi-step tasks.

Quick Start & Requirements

Installation requires pip install appworld, followed by appworld install to unpack encrypted code and appworld download data to fetch benchmark datasets. Python 3.11+ is a prerequisite. Key resources include a website, task/API explorers, a leaderboard, and extensive documentation.

Highlighted Details

Awarded ACL'24 Best Resource Paper.
Features 9 apps with 457 APIs, simulating ~100 users.
Supports interactive coding via world.execute() with state persistence.
Integrates Model Context Protocol (MCP) for standardized tool access.
Includes code execution safety features (syntax checking, function patching).
Offers optional Docker deployment for API serving and isolation.
Provides detailed evaluation metrics (TGC, SGC) and per-task reports.

Maintenance & Community

Hosted on GitHub, the project appears actively maintained with clear channels for feedback and contributions via issues. Specific community links (e.g., Discord, Slack) are not explicitly mentioned.

Licensing & Compatibility

Public components are Apache 2.0 licensed. Protected portions (app/task specifics) are also Apache 2.0 but require public redistribution of derivatives to remain encrypted. LLM training is permitted.

Limitations & Caveats

Key app/task data is in encrypted .bundle files, limiting direct GitHub inspection. Test sets provide only evaluation programs to prevent leakage. The README cautions against posting extracted .bundle content online. While safety features are robust, Docker is recommended for maximum isolation. Realistic state reversion is not supported.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days