BIRD-CRITIC-1 by bird-bench

SQL benchmark for diagnosing/solving user issues in real-world databases

Created 11 months ago

1,082 stars

Top 35.1% on SourcePulse

Project Summary

BIRD-CRITIC 1.0 is a comprehensive SQL benchmark designed to assess Large Language Models' (LLMs) ability to diagnose and resolve user-reported issues in real-world database applications. It targets researchers and developers working on LLM-powered database tools, offering a rigorous evaluation framework for SQL problem-solving capabilities across multiple dialects.

How It Works

The benchmark comprises 800 tasks (600 for development, 200 out-of-distribution) covering MySQL, PostgreSQL, SQL Server, and Oracle. It moves beyond simple SELECT queries to include CRUD operations and efficiency tuning, reflecting practical database challenges. Each task is human-verified for reproducibility and includes specific evaluation metrics: Soft EX (SELECT-only), Soft EX + Parsing (user-defined refinements), Test Case (logic correctness for CRUD/multi-query), and Query Execution Plan (efficiency/runtime error analysis). An optimized execution-based evaluation environment using Docker and PostgreSQL templates ensures efficient and consistent validation.

Quick Start & Requirements

Installation: Use datasets.load_dataset from HuggingFace for flash (birdsql/bird-critic-1.0-flash-exp) or open (birdsql/bird-critic-1.0-open) versions. Alternatively, use pull_data.py for the open version.
Prerequisites: Python 3.10 (via Conda), requirements.txt dependencies, LLM API keys (configured in config.py), and database dumps (PostgreSQL, MySQL, SQL Server, Oracle) for evaluation.
Setup: Requires setting up a Conda environment, installing dependencies, configuring LLM API keys, and downloading/unzipping database dumps. Evaluation uses Docker Compose.
Links: HuggingFace Datasets, Quick Eval Folder Structure

Highlighted Details

Features 200 held-out out-of-distribution (OOD) tests for robust generalization evaluation.
Includes a "lite" version (bird-critic-1.0-flash-exp) focused on PostgreSQL for quicker iteration.
Provides baseline code for generating LLM outputs and an evaluation framework.
Supports multiple evaluation metrics tailored to different SQL task complexities.

Maintenance & Community

Created by the BIRD Team & Google Cloud.
Roadmap includes updating agent baselines and future benchmark versions (BIRD-CRITIC 1.5 / 2.0).
Contact: bird.bench23@gmail.com or bird.bench25@gmail.com for full dataset access.

Licensing & Compatibility

License: cc-by-sa-4.0 (Creative Commons Attribution-ShareAlike 4.0 International).
This license requires derivative works to be shared under the same or a compatible license. Commercial use is generally permitted, but modifications must be shared.

Limitations & Caveats

The full dataset with ground truth solutions and test cases is not directly included in the public HuggingFace release to prevent data leakage; email is required for access.
Some planned features, such as a full PostgreSQL-specific 600-instance version and updated agent baselines, are still pending.

Health Check

Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

224 stars in the last 30 days