Spider2 by xlang-ai

Benchmark dataset for text-to-SQL evaluation in enterprise settings

Created 1 year ago

751 stars

Top 46.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Binyuan Hui

Research Scientist at Alibaba Qwen

Project Summary

Spider 2.0 is a benchmark dataset designed to evaluate Large Language Models (LLMs) on complex, real-world enterprise Text-to-SQL tasks. It targets researchers and developers working on LLM-powered data analysis and SQL generation, offering a significantly more challenging evaluation than previous benchmarks. The dataset aims to drive advancements in LLM code generation capabilities for intricate data environments and multiple SQL dialects.

How It Works

Spider 2.0 presents a more realistic challenge by incorporating large databases with over 3000 columns and supporting multiple SQL dialects like BigQuery and Snowflake. It includes diverse operations such as data transformation and analytics, moving beyond simple query generation. The benchmark is structured into versions like Spider 2.0-Snow and Spider 2.0-Lite, with an additional "Code agent" task type for evaluating agent-based LLM workflows. This approach aims to better reflect the complexities encountered in enterprise data scenarios.

Quick Start & Requirements

To access the full datasets, users must sign up for BigQuery and Snowflake accounts and follow provided guidelines for credentials and access.
For benchmarking LLMs, the recommended approach is to use the Spider-Agent Framework with spider-agent-lite and spider-agent-snow.
Official validation and leaderboard submission require following specific submission guidance.
Links: Website, Paper, spider-agent-lite, spider-agent-snow

Highlighted Details

Significantly more challenging than Spider 1.0 and BIRD, with top LLMs like GPT-4 achieving only 6.0% accuracy on Spider 2.0 tasks.
Supports multiple SQL dialects including BigQuery, Snowflake, PostgreSQL, ClickHouse, and SQLite.
Includes versions for direct benchmarking (spider2-lite, spider2-snow) and agent-based evaluation (spider-agent-lite, spider-agent-snow).
Data updates and gold SQL releases are ongoing, with a dynamic leaderboard for official validation.

Maintenance & Community

The project is associated with the ICLR 2025 conference and acknowledges contributions from various researchers. Snowflake provided significant support for hosting the challenge. Further community engagement details (e.g., Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The dataset and associated tools are released under a permissive license, allowing for research and development. Specific license details are not explicitly stated in the README, but the focus on academic evaluation suggests broad compatibility for research purposes.

Limitations & Caveats

The README advises against using the released Gold SQL for fine-tuning LLMs, as it may compromise evaluation fairness. Accessing the full dataset requires setting up cloud database accounts (BigQuery, Snowflake), which may incur costs and setup time. The project is actively updated, and leaderboard results may change dynamically.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

26 stars in the last 30 days