Benchmark for evaluating LLMs on planning tasks
Top 74.3% on sourcepulse
This repository provides an extensible benchmark for evaluating the planning and reasoning capabilities of Large Language Models (LLMs) and Language Reasoning Models (LRMs). It targets researchers and developers working on AI planning, LLM evaluation, and agent development, offering a standardized way to assess model performance across various planning domains and prompting strategies.
How It Works
The benchmark comprises codebases from three key papers, focusing on evaluating LLMs' ability to generate valid plans from natural language descriptions or PDDL. It includes static test sets for Blocksworld and Mystery Blocksworld, with performance metrics reported for zero-shot prompting. The evaluation methodology assesses models' success rates in generating correct plans for given problem instances.
Quick Start & Requirements
Highlighted Details
llm_planning_analysis/results/backprompting/
.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README does not provide explicit installation or execution instructions, requiring users to infer setup from the associated papers. License information is absent, potentially impacting commercial adoption.
1 month ago
Inactive