appworld  by StonyBrookNLP

Controllable world for benchmarking interactive coding agents

Created 1 year ago
270 stars

Top 95.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

AppWorld provides a high-fidelity, controllable simulated environment for benchmarking interactive coding agents. It features 9 day-to-day applications with over 450 APIs, simulating ~100 users, enabling agents to perform complex, interactive coding tasks. This platform offers a standardized benchmark for evaluating agent capabilities in realistic scenarios.

How It Works

The system simulates a world of apps and people, allowing agents to interact via Python code making API calls. It supports stateful execution, maintaining context across interactions, and provides a comprehensive set of 457 APIs across 9 applications. This design facilitates the development and rigorous evaluation of agents for complex, multi-step tasks.

Quick Start & Requirements

Installation requires pip install appworld, followed by appworld install to unpack encrypted code and appworld download data to fetch benchmark datasets. Python 3.11+ is a prerequisite. Key resources include a website, task/API explorers, a leaderboard, and extensive documentation.

Highlighted Details

  • Awarded ACL'24 Best Resource Paper.
  • Features 9 apps with 457 APIs, simulating ~100 users.
  • Supports interactive coding via world.execute() with state persistence.
  • Integrates Model Context Protocol (MCP) for standardized tool access.
  • Includes code execution safety features (syntax checking, function patching).
  • Offers optional Docker deployment for API serving and isolation.
  • Provides detailed evaluation metrics (TGC, SGC) and per-task reports.

Maintenance & Community

Hosted on GitHub, the project appears actively maintained with clear channels for feedback and contributions via issues. Specific community links (e.g., Discord, Slack) are not explicitly mentioned.

Licensing & Compatibility

Public components are Apache 2.0 licensed. Protected portions (app/task specifics) are also Apache 2.0 but require public redistribution of derivatives to remain encrypted. LLM training is permitted.

Limitations & Caveats

Key app/task data is in encrypted .bundle files, limiting direct GitHub inspection. Test sets provide only evaluation programs to prevent leakage. The README cautions against posting extracted .bundle content online. While safety features are robust, Docker is recommended for maximum isolation. Realistic state reversion is not supported.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
28 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.