SQL benchmark for diagnosing/solving user issues in real-world databases
Top 46.5% on sourcepulse
BIRD-CRITIC 1.0 is a comprehensive SQL benchmark designed to assess Large Language Models' (LLMs) ability to diagnose and resolve user-reported issues in real-world database applications. It targets researchers and developers working on LLM-powered database tools, offering a rigorous evaluation framework for SQL problem-solving capabilities across multiple dialects.
How It Works
The benchmark comprises 800 tasks (600 for development, 200 out-of-distribution) covering MySQL, PostgreSQL, SQL Server, and Oracle. It moves beyond simple SELECT queries to include CRUD operations and efficiency tuning, reflecting practical database challenges. Each task is human-verified for reproducibility and includes specific evaluation metrics: Soft EX (SELECT-only), Soft EX + Parsing (user-defined refinements), Test Case (logic correctness for CRUD/multi-query), and Query Execution Plan (efficiency/runtime error analysis). An optimized execution-based evaluation environment using Docker and PostgreSQL templates ensures efficient and consistent validation.
Quick Start & Requirements
datasets.load_dataset
from HuggingFace for flash (birdsql/bird-critic-1.0-flash-exp
) or open (birdsql/bird-critic-1.0-open
) versions. Alternatively, use pull_data.py
for the open version.requirements.txt
dependencies, LLM API keys (configured in config.py
), and database dumps (PostgreSQL, MySQL, SQL Server, Oracle) for evaluation.Highlighted Details
bird-critic-1.0-flash-exp
) focused on PostgreSQL for quicker iteration.Maintenance & Community
bird.bench23@gmail.com
or bird.bench25@gmail.com
for full dataset access.Licensing & Compatibility
Limitations & Caveats
3 weeks ago
1 day