Planning benchmark for real-world language agents
Top 74.2% on sourcepulse
TravelPlanner is a benchmark designed to evaluate language agents on complex, multi-constraint planning tasks in real-world scenarios, such as itinerary creation. It targets researchers and developers building AI agents capable of tool use and sophisticated planning, offering a standardized way to measure performance against environmental, commonsense, and hard constraints.
How It Works
The benchmark supports two modes: "Two-stage" where agents use search tools to gather information before planning, and "Sole-planning" which focuses purely on the planning ability given all necessary information. Plans are initially generated in natural language and then parsed into structured JSON formats using GPT-4 for evaluation. This approach allows for a comprehensive assessment of both information retrieval and logical sequencing capabilities in agent design.
Quick Start & Requirements
conda create -n travelplanner python=3.9
, conda activate travelplanner
) and install dependencies (pip install -r requirements.txt
).TravelPlanner
directory.gpt-3.5-turbo-X
, gpt-4-1106-preview
, gemini
, mistral-7B-32K
, mixtral
.Highlighted Details
Maintenance & Community
The project is associated with the ICML'24 conference. Updates are posted regularly, including model releases and format check tools. Contact information for Jian Xie, Kai Zhang, and Yu Su is provided.
Licensing & Compatibility
The repository does not explicitly state a license. The README mentions that extending and editing the database for new tasks is permitted provided adherence to licensing terms, implying a license is in place but not specified.
Limitations & Caveats
The benchmark strictly prohibits reverse engineering, hard coding evaluation cues, or other human interference strategies that lack generalization. Violations may lead to disqualification. The license is not explicitly stated in the README, which could pose compatibility issues for commercial use.
1 month ago
1 day