BIRD-Interact by bird-bench

Interactive Text-to-SQL benchmark

Created 9 months ago

674 stars

Top 50.1% on SourcePulse

Project Summary

Text-to-SQL models are evaluated through dynamic, multi-turn interactions simulating enterprise environments with BIRD-INTERACT. This benchmark uses a hierarchical knowledge base and a function-driven user simulator to recreate authentic scenarios, offering two rigorous modes: 'c-Interact' for passive, fixed-workflow conversations and 'a-Interact' for active, model-led agentic interactions. It challenges models to handle ambiguity and sustain communication over extended dialogues, far exceeding typical static benchmarks, making it valuable for researchers and developers pushing the boundaries of Text-to-SQL capabilities.

How It Works

BIRD-INTERACT evaluates Text-to-SQL models through dynamic, multi-turn interactions simulating enterprise environments. It employs a hierarchical knowledge base and a function-driven user simulator. Two modes are offered: 'c-Interact' for passive, fixed-workflow conversations and 'a-Interact' for active, model-led agentic interactions. This approach rigorously tests models' ability to handle ambiguity and sustain communication over extended dialogues, far exceeding typical static benchmarks.

Quick Start & Requirements

A lite version (bird-interact-lite-exp) with 270 PostgreSQL tasks is available for quick experimentation. The full 600-task version (bird-interact-full) is pending release. PostgreSQL is the primary database. Ground truth SQLs and test cases are not bundled with the dataset and must be requested via email to bird.bench25@gmail.com with the tag [bird-interact-lite GT&Test Cases]. Code for conversational (bird_interact_conv) and agentic (bird_interact_agent) modes is provided, with dependencies listed in requirements.txt and Docker support in the evaluation directory.

Highlighted Details

State-of-the-art models achieve low success rates (≈24% c-Interact, ≈18% a-Interact) on this challenging benchmark, highlighting its rigor. Performance tables for the lite version showcase leading models like o3-mini and GPT-4o. Notably, Claude-3.7-sonnet is the only model identified as satisfying the Interaction-Time Scaling (ITS) law, demonstrating improved performance through sustained multi-turn dialogue.

Maintenance & Community

Created by the BIRD Team & Google Cloud. A "Todo Lists" section indicates ongoing development, including planned SFT/RL training for the user simulator. No direct community links (e.g., Discord, Slack) are provided in the README snippet.

Licensing & Compatibility

Licensed under cc-by-sa-4.0 (Creative Commons Attribution-ShareAlike 4.0 International). This strong copyleft license requires derivative works to be shared under the same terms, which may impose restrictions on commercial use or integration into closed-source projects.

Limitations & Caveats

The full 600-task benchmark (bird-interact-full) is not yet released. Access to ground truth SQLs and test cases requires a separate email request. A recent bug fix in the Bird-Interact-Agent code suggests ongoing development and potential for undiscovered issues. The user simulator's SFT/RL training is still a planned feature, indicating potential for future improvements or current limitations in simulator realism.

BIRD-Interact by bird-bench

Explore Similar Projects

AI-SmartFuse-Framework by mainpropath

natural-sql by cfahlgren1

lumen by holoviz

BIRD-CRITIC-1 by bird-bench

NSQL by NumbersStationAI

Agentar-Scale-SQL by antgroup

HiGoalVita by HiGoalV

bagofwords by bagofwords1

data-peek by Rohithgilla12

Awesome-LLM-based-Text2SQL by DEEP-PolyU

bloom by safety-research

terminal-bench by harbor-framework