TravelPlanner by OSU-NLP-Group

Planning benchmark for real-world language agents

Created 2 years ago

475 stars

Top 64.4% on SourcePulse

Project Summary

TravelPlanner is a benchmark designed to evaluate language agents on complex, multi-constraint planning tasks in real-world scenarios, such as itinerary creation. It targets researchers and developers building AI agents capable of tool use and sophisticated planning, offering a standardized way to measure performance against environmental, commonsense, and hard constraints.

How It Works

The benchmark supports two modes: "Two-stage" where agents use search tools to gather information before planning, and "Sole-planning" which focuses purely on the planning ability given all necessary information. Plans are initially generated in natural language and then parsed into structured JSON formats using GPT-4 for evaluation. This approach allows for a comprehensive assessment of both information retrieval and logical sequencing capabilities in agent design.

Quick Start & Requirements

Install: Create a conda environment (conda create -n travelplanner python=3.9, conda activate travelplanner) and install dependencies (pip install -r requirements.txt).
Data: Download the database and unzip it into the TravelPlanner directory.
API Keys: Requires OpenAI API key and optionally Google API key.
Models: Supports gpt-3.5-turbo-X, gpt-4-1106-preview, gemini, mistral-7B-32K, mixtral.
Resources: Fine-tuned models (Llama3.1-8B-Instruct, Qwen2-7B-Instruct) are available on HuggingFace.
Links: Website, Paper, Dataset, Leaderboard.

Highlighted Details

Evaluates agents on planning with transportation, meals, attractions, and accommodation.
Incorporates Environment, Commonsense, and Hard constraints for realistic scenarios.
Provides a format check tool for test set submissions.
Offers fine-tuned models (Llama3.1-8B-Instruct, Qwen2-7B-Instruct) demonstrating improved performance on the benchmark.

Maintenance & Community

The project is associated with the ICML'24 conference. Updates are posted regularly, including model releases and format check tools. Contact information for Jian Xie, Kai Zhang, and Yu Su is provided.

Licensing & Compatibility

The repository does not explicitly state a license. The README mentions that extending and editing the database for new tasks is permitted provided adherence to licensing terms, implying a license is in place but not specified.

Limitations & Caveats

The benchmark strictly prohibits reverse engineering, hard coding evaluation cues, or other human interference strategies that lack generalization. Violations may lead to disqualification. The license is not explicitly stated in the README, which could pose compatibility issues for commercial use.

TravelPlanner by OSU-NLP-Group

Explore Similar Projects

llm_benchmark by llm2014

saplings by shobrook

AnyTool by dyabel

LLM-Planning-Papers by AGI-Edgerunners

llm-pddl by Cranial-XIX

LLMs-Planning by karthikv792

ReCode by FoundationAgents

factorio-learning-environment by JackHopkins

LanguageAgentTreeSearch by lapisrocks

SuperCLUE by CLUEbenchmark

KwaiAgents by KwaiKEG

AutoAgents by Link-AGI