TravelPlanner  by OSU-NLP-Group

Planning benchmark for real-world language agents

created 1 year ago
394 stars

Top 74.2% on sourcepulse

GitHubView on GitHub
Project Summary

TravelPlanner is a benchmark designed to evaluate language agents on complex, multi-constraint planning tasks in real-world scenarios, such as itinerary creation. It targets researchers and developers building AI agents capable of tool use and sophisticated planning, offering a standardized way to measure performance against environmental, commonsense, and hard constraints.

How It Works

The benchmark supports two modes: "Two-stage" where agents use search tools to gather information before planning, and "Sole-planning" which focuses purely on the planning ability given all necessary information. Plans are initially generated in natural language and then parsed into structured JSON formats using GPT-4 for evaluation. This approach allows for a comprehensive assessment of both information retrieval and logical sequencing capabilities in agent design.

Quick Start & Requirements

  • Install: Create a conda environment (conda create -n travelplanner python=3.9, conda activate travelplanner) and install dependencies (pip install -r requirements.txt).
  • Data: Download the database and unzip it into the TravelPlanner directory.
  • API Keys: Requires OpenAI API key and optionally Google API key.
  • Models: Supports gpt-3.5-turbo-X, gpt-4-1106-preview, gemini, mistral-7B-32K, mixtral.
  • Resources: Fine-tuned models (Llama3.1-8B-Instruct, Qwen2-7B-Instruct) are available on HuggingFace.
  • Links: Website, Paper, Dataset, Leaderboard.

Highlighted Details

  • Evaluates agents on planning with transportation, meals, attractions, and accommodation.
  • Incorporates Environment, Commonsense, and Hard constraints for realistic scenarios.
  • Provides a format check tool for test set submissions.
  • Offers fine-tuned models (Llama3.1-8B-Instruct, Qwen2-7B-Instruct) demonstrating improved performance on the benchmark.

Maintenance & Community

The project is associated with the ICML'24 conference. Updates are posted regularly, including model releases and format check tools. Contact information for Jian Xie, Kai Zhang, and Yu Su is provided.

Licensing & Compatibility

The repository does not explicitly state a license. The README mentions that extending and editing the database for new tasks is permitted provided adherence to licensing terms, implying a license is in place but not specified.

Limitations & Caveats

The benchmark strictly prohibits reverse engineering, hard coding evaluation cues, or other human interference strategies that lack generalization. Violations may lead to disqualification. The license is not explicitly stated in the README, which could pose compatibility issues for commercial use.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
43 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.