Toolathlon  by hkust-nlp

Benchmarking language agents for diverse, long-horizon tool use

Created 4 months ago
267 stars

Top 96.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary Toolathlon is an ICLR 2026 benchmark evaluating language agents on diverse, realistic, long-horizon tasks. It addresses the need for robust assessment of general tool use with over 600 real-world software tools in containerized environments. This benchmark benefits researchers and engineers developing AI agents requiring complex, multi-step software interactions.

How It Works The benchmark uses containerized execution (Docker/Podman) for isolated, reproducible task environments. Agents interact with over 600 real-world software tools, demanding long-horizon planning. Toolathlon offers a unified model provider for LLM API integration and supports a decoupled agent loop architecture for flexible scaffold implementation.

Quick Start & Requirements

  • Installation: Requires uv. Core setup involves bash global_preparation/install_env_minimal.sh and bash global_preparation/pull_toolathlon_image.sh after Docker/Podman installation. LLM API keys/base URLs must be configured via environment variables or configs/global_configs.py.
  • Prerequisites: Docker/Podman, Linux recommended, internet access. Users must register accounts and configure tokens/keys per how2register_accounts.md.
  • Setup Time/Footprint: Local setup is complex. A public evaluation service offers immediate use.
  • Links: Website, Discord, arXiv, Hugging Face Datasets, GitHub.

Highlighted Details

  • Features over 600 diverse tools from real-world software environments.
  • Tasks require long-horizon tool calls for completion.
  • Offers a ready-to-use public evaluation service.
  • Supports decoupled agent loops for integration with various agent frameworks.
  • Includes a visualization tool (vis_traj) for analyzing agent reasoning trajectories.

Maintenance & Community An active community is maintained via Discord, with regular updates on model trajectories and documentation. Support is available via email and GitHub issues.

Licensing & Compatibility The project's license is not explicitly stated in the README. This omission hinders assessment of commercial use or closed-source linking compatibility.

Limitations & Caveats Local setup demands substantial configuration, including Docker/Podman and account/API key registration. Some advanced configurations may require sudo privileges. The absence of a clear license is a significant adoption blocker.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
2
Star History
49 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.