BIRD-Interact  by bird-bench

Interactive Text-to-SQL benchmark

Created 8 months ago
455 stars

Top 66.5% on SourcePulse

GitHubView on GitHub
Project Summary

Text-to-SQL models are evaluated through dynamic, multi-turn interactions simulating enterprise environments with BIRD-INTERACT. This benchmark uses a hierarchical knowledge base and a function-driven user simulator to recreate authentic scenarios, offering two rigorous modes: 'c-Interact' for passive, fixed-workflow conversations and 'a-Interact' for active, model-led agentic interactions. It challenges models to handle ambiguity and sustain communication over extended dialogues, far exceeding typical static benchmarks, making it valuable for researchers and developers pushing the boundaries of Text-to-SQL capabilities.

How It Works

BIRD-INTERACT evaluates Text-to-SQL models through dynamic, multi-turn interactions simulating enterprise environments. It employs a hierarchical knowledge base and a function-driven user simulator. Two modes are offered: 'c-Interact' for passive, fixed-workflow conversations and 'a-Interact' for active, model-led agentic interactions. This approach rigorously tests models' ability to handle ambiguity and sustain communication over extended dialogues, far exceeding typical static benchmarks.

Quick Start & Requirements

A lite version (bird-interact-lite-exp) with 270 PostgreSQL tasks is available for quick experimentation. The full 600-task version (bird-interact-full) is pending release. PostgreSQL is the primary database. Ground truth SQLs and test cases are not bundled with the dataset and must be requested via email to bird.bench25@gmail.com with the tag [bird-interact-lite GT&Test Cases]. Code for conversational (bird_interact_conv) and agentic (bird_interact_agent) modes is provided, with dependencies listed in requirements.txt and Docker support in the evaluation directory.

Highlighted Details

State-of-the-art models achieve low success rates (≈24% c-Interact, ≈18% a-Interact) on this challenging benchmark, highlighting its rigor. Performance tables for the lite version showcase leading models like o3-mini and GPT-4o. Notably, Claude-3.7-sonnet is the only model identified as satisfying the Interaction-Time Scaling (ITS) law, demonstrating improved performance through sustained multi-turn dialogue.

Maintenance & Community

Created by the BIRD Team & Google Cloud. A "Todo Lists" section indicates ongoing development, including planned SFT/RL training for the user simulator. No direct community links (e.g., Discord, Slack) are provided in the README snippet.

Licensing & Compatibility

Licensed under cc-by-sa-4.0 (Creative Commons Attribution-ShareAlike 4.0 International). This strong copyleft license requires derivative works to be shared under the same terms, which may impose restrictions on commercial use or integration into closed-source projects.

Limitations & Caveats

The full 600-task benchmark (bird-interact-full) is not yet released. Access to ground truth SQLs and test cases requires a separate email request. A recent bug fix in the Bird-Interact-Agent code suggests ongoing development and potential for undiscovered issues. The user simulator's SFT/RL training is still a planned feature, indicating potential for future improvements or current limitations in simulator realism.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Andreas Jansson Andreas Jansson(Cofounder of Replicate).

natural-sql by cfahlgren1

0%
867
Text-to-SQL LLMs with strong performance
Created 2 years ago
Updated 1 year ago
Starred by Dan Abramov Dan Abramov(Core Contributor to React; Coauthor of Redux, Create React App), Gabriel Almeida Gabriel Almeida(Cofounder of Langflow), and
9 more.

terminal-bench by laude-institute

2.8%
1k
Benchmark for LLM agents in real terminal environments
Created 1 year ago
Updated 5 days ago
Feedback? Help us improve.