LLMs-Planning  by karthikv792

Benchmark for evaluating LLMs on planning tasks

created 3 years ago
393 stars

Top 74.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an extensible benchmark for evaluating the planning and reasoning capabilities of Large Language Models (LLMs) and Language Reasoning Models (LRMs). It targets researchers and developers working on AI planning, LLM evaluation, and agent development, offering a standardized way to assess model performance across various planning domains and prompting strategies.

How It Works

The benchmark comprises codebases from three key papers, focusing on evaluating LLMs' ability to generate valid plans from natural language descriptions or PDDL. It includes static test sets for Blocksworld and Mystery Blocksworld, with performance metrics reported for zero-shot prompting. The evaluation methodology assesses models' success rates in generating correct plans for given problem instances.

Quick Start & Requirements

  • Installation and execution details are not explicitly provided in the README.
  • The benchmark evaluates models using natural language and PDDL prompts.
  • Access to specific LLMs or LRMs is required for running evaluations.
  • Refer to the individual paper subdirectories for specific setup instructions.

Highlighted Details

  • Includes a leaderboard showcasing zero-shot performance of various LLMs (GPT-4o, Claude 3.5 Sonnet, LLaMA-3.1) and LRMs (Deepseek R1, o1) on Blocksworld and Mystery Blocksworld tasks.
  • Supports evaluation using both Natural Language (NL) and Planning Domain Definition Language (PDDL) prompting.
  • Features test sets for Blocksworld Hard, with results available in llm_planning_analysis/results/backprompting/.
  • Codebase is derived from three peer-reviewed publications in AI planning and LLM evaluation.

Maintenance & Community

  • The project is associated with Karthik Valmeekam and Subbarao Kambhampati, prominent researchers in AI planning.
  • The leaderboard is open for submissions via pull requests.
  • Citations for the included papers are provided.

Licensing & Compatibility

  • The README does not specify a license.
  • Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The README does not provide explicit installation or execution instructions, requiring users to infer setup from the associated papers. License information is absent, potentially impacting commercial adoption.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
42 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.