ParseBench by run-llama

AI agent document parsing benchmark

Created 3 months ago

520 stars

Top 59.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Jerry Liu

Cofounder of LlamaIndex

Simon Suo

Cofounder of LlamaIndex

Project Summary

ParseBench is a benchmark designed to evaluate the effectiveness of document parsing tools in converting PDFs into structured data usable by AI agents. It addresses the gap where traditional parsing evaluations focus on visual fidelity rather than the semantic and structural integrity required for autonomous decision-making. This benchmark is crucial for engineers and researchers building AI agents that rely on accurate document comprehension, offering a standardized method to compare and select optimal parsing solutions.

How It Works

The benchmark evaluates parsing tools across five critical capability dimensions: Tables, Charts, Content Faithfulness, Semantic Formatting, and Visual Grounding. Each dimension targets specific failure modes that commonly disrupt AI agent workflows, such as misinterpreting table headers, extracting incorrect chart data points, or suffering from omissions and hallucinations. By using deterministic, rule-based evaluation metrics, ParseBench provides objective scores that reflect the practical utility of parsed documents for downstream AI tasks, moving beyond subjective assessments.

Quick Start & Requirements

Primary install / run command: Install dependencies with uv sync --extra runners. Run evaluations using uv run parse-bench run <pipeline_name>, with a quick test option uv run parse-bench run llamaparse_agentic --test.
Non-default prerequisites and dependencies: Requires a .env file at the project root containing API keys for the specific parsing tool being evaluated (e.g., LLAMA_CLOUD_API_KEY, OPENAI_API_KEY).
Links:
- Leaderboard: parsebench.ai
- Dataset: llamaindex/ParseBench on HuggingFace
- Paper: arXiv:2604.08538
- Available Pipelines: docs/pipelines.md

Highlighted Details

Features a dataset of approximately 2,000 human-verified pages from real-world enterprise documents across finance and insurance sectors.
Evaluates on five key dimensions: Tables, Charts, Content Faithfulness, Semantic Formatting, and Visual Grounding, each with specific metrics and ground truth formats.
Provides a public leaderboard showcasing performance scores and cost-per-page metrics for various parsing pipelines, with LlamaParse Agentic leading at 84.88 overall.
Supports over 90 pre-configured pipelines for evaluating diverse parsing tools and configurations.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord or Slack), or ongoing development signals were present in the provided README.

Licensing & Compatibility

The license type and any compatibility notes for commercial use or closed-source linking were not specified in the provided README.

Limitations & Caveats

The benchmark focuses on specific functional aspects critical for AI agents; other document parsing nuances may not be covered. Evaluation is deterministic and rule-based, not relying on LLM-as-a-judge. Running evaluations requires obtaining and configuring API keys for the tools under test.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

29 stars in the last 30 days