BIRD-CRITIC-1  by bird-bench

SQL benchmark for diagnosing/solving user issues in real-world databases

created 6 months ago
766 stars

Top 46.5% on sourcepulse

GitHubView on GitHub
Project Summary

BIRD-CRITIC 1.0 is a comprehensive SQL benchmark designed to assess Large Language Models' (LLMs) ability to diagnose and resolve user-reported issues in real-world database applications. It targets researchers and developers working on LLM-powered database tools, offering a rigorous evaluation framework for SQL problem-solving capabilities across multiple dialects.

How It Works

The benchmark comprises 800 tasks (600 for development, 200 out-of-distribution) covering MySQL, PostgreSQL, SQL Server, and Oracle. It moves beyond simple SELECT queries to include CRUD operations and efficiency tuning, reflecting practical database challenges. Each task is human-verified for reproducibility and includes specific evaluation metrics: Soft EX (SELECT-only), Soft EX + Parsing (user-defined refinements), Test Case (logic correctness for CRUD/multi-query), and Query Execution Plan (efficiency/runtime error analysis). An optimized execution-based evaluation environment using Docker and PostgreSQL templates ensures efficient and consistent validation.

Quick Start & Requirements

  • Installation: Use datasets.load_dataset from HuggingFace for flash (birdsql/bird-critic-1.0-flash-exp) or open (birdsql/bird-critic-1.0-open) versions. Alternatively, use pull_data.py for the open version.
  • Prerequisites: Python 3.10 (via Conda), requirements.txt dependencies, LLM API keys (configured in config.py), and database dumps (PostgreSQL, MySQL, SQL Server, Oracle) for evaluation.
  • Setup: Requires setting up a Conda environment, installing dependencies, configuring LLM API keys, and downloading/unzipping database dumps. Evaluation uses Docker Compose.
  • Links: HuggingFace Datasets, Quick Eval Folder Structure

Highlighted Details

  • Features 200 held-out out-of-distribution (OOD) tests for robust generalization evaluation.
  • Includes a "lite" version (bird-critic-1.0-flash-exp) focused on PostgreSQL for quicker iteration.
  • Provides baseline code for generating LLM outputs and an evaluation framework.
  • Supports multiple evaluation metrics tailored to different SQL task complexities.

Maintenance & Community

  • Created by the BIRD Team & Google Cloud.
  • Roadmap includes updating agent baselines and future benchmark versions (BIRD-CRITIC 1.5 / 2.0).
  • Contact: bird.bench23@gmail.com or bird.bench25@gmail.com for full dataset access.

Licensing & Compatibility

  • License: cc-by-sa-4.0 (Creative Commons Attribution-ShareAlike 4.0 International).
  • This license requires derivative works to be shared under the same or a compatible license. Commercial use is generally permitted, but modifications must be shared.

Limitations & Caveats

  • The full dataset with ground truth solutions and test cases is not directly included in the public HuggingFace release to prevent data leakage; email is required for access.
  • Some planned features, such as a full PostgreSQL-specific 600-instance version and updated agent baselines, are still pending.
Health Check
Last commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
203 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.