Discover and explore top open-source AI tools and projects—updated daily.
bird-benchInteractive Text-to-SQL benchmark
Top 66.5% on SourcePulse
Text-to-SQL models are evaluated through dynamic, multi-turn interactions simulating enterprise environments with BIRD-INTERACT. This benchmark uses a hierarchical knowledge base and a function-driven user simulator to recreate authentic scenarios, offering two rigorous modes: 'c-Interact' for passive, fixed-workflow conversations and 'a-Interact' for active, model-led agentic interactions. It challenges models to handle ambiguity and sustain communication over extended dialogues, far exceeding typical static benchmarks, making it valuable for researchers and developers pushing the boundaries of Text-to-SQL capabilities.
How It Works
BIRD-INTERACT evaluates Text-to-SQL models through dynamic, multi-turn interactions simulating enterprise environments. It employs a hierarchical knowledge base and a function-driven user simulator. Two modes are offered: 'c-Interact' for passive, fixed-workflow conversations and 'a-Interact' for active, model-led agentic interactions. This approach rigorously tests models' ability to handle ambiguity and sustain communication over extended dialogues, far exceeding typical static benchmarks.
Quick Start & Requirements
A lite version (bird-interact-lite-exp) with 270 PostgreSQL tasks is available for quick experimentation. The full 600-task version (bird-interact-full) is pending release. PostgreSQL is the primary database. Ground truth SQLs and test cases are not bundled with the dataset and must be requested via email to bird.bench25@gmail.com with the tag [bird-interact-lite GT&Test Cases]. Code for conversational (bird_interact_conv) and agentic (bird_interact_agent) modes is provided, with dependencies listed in requirements.txt and Docker support in the evaluation directory.
Highlighted Details
State-of-the-art models achieve low success rates (≈24% c-Interact, ≈18% a-Interact) on this challenging benchmark, highlighting its rigor. Performance tables for the lite version showcase leading models like o3-mini and GPT-4o. Notably, Claude-3.7-sonnet is the only model identified as satisfying the Interaction-Time Scaling (ITS) law, demonstrating improved performance through sustained multi-turn dialogue.
Maintenance & Community
Created by the BIRD Team & Google Cloud. A "Todo Lists" section indicates ongoing development, including planned SFT/RL training for the user simulator. No direct community links (e.g., Discord, Slack) are provided in the README snippet.
Licensing & Compatibility
Licensed under cc-by-sa-4.0 (Creative Commons Attribution-ShareAlike 4.0 International). This strong copyleft license requires derivative works to be shared under the same terms, which may impose restrictions on commercial use or integration into closed-source projects.
Limitations & Caveats
The full 600-task benchmark (bird-interact-full) is not yet released. Access to ground truth SQLs and test cases requires a separate email request. A recent bug fix in the Bird-Interact-Agent code suggests ongoing development and potential for undiscovered issues. The user simulator's SFT/RL training is still a planned feature, indicating potential for future improvements or current limitations in simulator realism.
4 days ago
Inactive
cfahlgren1
laude-institute