LLMs-Planning by karthikv792

Benchmark for evaluating LLMs on planning tasks

Created 3 years ago

451 stars

Top 66.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

This repository provides an extensible benchmark for evaluating the planning and reasoning capabilities of Large Language Models (LLMs) and Language Reasoning Models (LRMs). It targets researchers and developers working on AI planning, LLM evaluation, and agent development, offering a standardized way to assess model performance across various planning domains and prompting strategies.

How It Works

The benchmark comprises codebases from three key papers, focusing on evaluating LLMs' ability to generate valid plans from natural language descriptions or PDDL. It includes static test sets for Blocksworld and Mystery Blocksworld, with performance metrics reported for zero-shot prompting. The evaluation methodology assesses models' success rates in generating correct plans for given problem instances.

Quick Start & Requirements

Installation and execution details are not explicitly provided in the README.
The benchmark evaluates models using natural language and PDDL prompts.
Access to specific LLMs or LRMs is required for running evaluations.
Refer to the individual paper subdirectories for specific setup instructions.

Highlighted Details

Includes a leaderboard showcasing zero-shot performance of various LLMs (GPT-4o, Claude 3.5 Sonnet, LLaMA-3.1) and LRMs (Deepseek R1, o1) on Blocksworld and Mystery Blocksworld tasks.
Supports evaluation using both Natural Language (NL) and Planning Domain Definition Language (PDDL) prompting.
Features test sets for Blocksworld Hard, with results available in llm_planning_analysis/results/backprompting/.
Codebase is derived from three peer-reviewed publications in AI planning and LLM evaluation.

Maintenance & Community

The project is associated with Karthik Valmeekam and Subbarao Kambhampati, prominent researchers in AI planning.
The leaderboard is open for submissions via pull requests.
Citations for the included papers are provided.

Licensing & Compatibility

The README does not specify a license.
Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The README does not provide explicit installation or execution instructions, requiring users to infer setup from the associated papers. License information is absent, potentially impacting commercial adoption.

Health Check

Last Commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days