Spider2  by xlang-ai

Benchmark dataset for text-to-SQL evaluation in enterprise settings

Created 1 year ago
565 stars

Top 56.8% on SourcePulse

GitHubView on GitHub
Project Summary

Spider 2.0 is a benchmark dataset designed to evaluate Large Language Models (LLMs) on complex, real-world enterprise Text-to-SQL tasks. It targets researchers and developers working on LLM-powered data analysis and SQL generation, offering a significantly more challenging evaluation than previous benchmarks. The dataset aims to drive advancements in LLM code generation capabilities for intricate data environments and multiple SQL dialects.

How It Works

Spider 2.0 presents a more realistic challenge by incorporating large databases with over 3000 columns and supporting multiple SQL dialects like BigQuery and Snowflake. It includes diverse operations such as data transformation and analytics, moving beyond simple query generation. The benchmark is structured into versions like Spider 2.0-Snow and Spider 2.0-Lite, with an additional "Code agent" task type for evaluating agent-based LLM workflows. This approach aims to better reflect the complexities encountered in enterprise data scenarios.

Quick Start & Requirements

  • To access the full datasets, users must sign up for BigQuery and Snowflake accounts and follow provided guidelines for credentials and access.
  • For benchmarking LLMs, the recommended approach is to use the Spider-Agent Framework with spider-agent-lite and spider-agent-snow.
  • Official validation and leaderboard submission require following specific submission guidance.
  • Links: Website, Paper, spider-agent-lite, spider-agent-snow

Highlighted Details

  • Significantly more challenging than Spider 1.0 and BIRD, with top LLMs like GPT-4 achieving only 6.0% accuracy on Spider 2.0 tasks.
  • Supports multiple SQL dialects including BigQuery, Snowflake, PostgreSQL, ClickHouse, and SQLite.
  • Includes versions for direct benchmarking (spider2-lite, spider2-snow) and agent-based evaluation (spider-agent-lite, spider-agent-snow).
  • Data updates and gold SQL releases are ongoing, with a dynamic leaderboard for official validation.

Maintenance & Community

The project is associated with the ICLR 2025 conference and acknowledges contributions from various researchers. Snowflake provided significant support for hosting the challenge. Further community engagement details (e.g., Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The dataset and associated tools are released under a permissive license, allowing for research and development. Specific license details are not explicitly stated in the README, but the focus on academic evaluation suggests broad compatibility for research purposes.

Limitations & Caveats

The README advises against using the released Gold SQL for fine-tuning LLMs, as it may compromise evaluation fairness. Accessing the full dataset requires setting up cloud database accounts (BigQuery, Snowflake), which may incur costs and setup time. The project is actively updated, and leaderboard results may change dynamically.

Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
16
Star History
36 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Andreas Jansson Andreas Jansson(Cofounder of Replicate).

natural-sql by cfahlgren1

0%
867
Text-to-SQL LLMs with strong performance
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.