Discover and explore top open-source AI tools and projects—updated daily.
Benchmark dataset for text-to-SQL evaluation in enterprise settings
Top 56.8% on SourcePulse
Spider 2.0 is a benchmark dataset designed to evaluate Large Language Models (LLMs) on complex, real-world enterprise Text-to-SQL tasks. It targets researchers and developers working on LLM-powered data analysis and SQL generation, offering a significantly more challenging evaluation than previous benchmarks. The dataset aims to drive advancements in LLM code generation capabilities for intricate data environments and multiple SQL dialects.
How It Works
Spider 2.0 presents a more realistic challenge by incorporating large databases with over 3000 columns and supporting multiple SQL dialects like BigQuery and Snowflake. It includes diverse operations such as data transformation and analytics, moving beyond simple query generation. The benchmark is structured into versions like Spider 2.0-Snow and Spider 2.0-Lite, with an additional "Code agent" task type for evaluating agent-based LLM workflows. This approach aims to better reflect the complexities encountered in enterprise data scenarios.
Quick Start & Requirements
spider-agent-lite
and spider-agent-snow
.Highlighted Details
spider2-lite
, spider2-snow
) and agent-based evaluation (spider-agent-lite
, spider-agent-snow
).Maintenance & Community
The project is associated with the ICLR 2025 conference and acknowledges contributions from various researchers. Snowflake provided significant support for hosting the challenge. Further community engagement details (e.g., Discord/Slack) are not explicitly mentioned in the README.
Licensing & Compatibility
The dataset and associated tools are released under a permissive license, allowing for research and development. Specific license details are not explicitly stated in the README, but the focus on academic evaluation suggests broad compatibility for research purposes.
Limitations & Caveats
The README advises against using the released Gold SQL for fine-tuning LLMs, as it may compromise evaluation fairness. Accessing the full dataset requires setting up cloud database accounts (BigQuery, Snowflake), which may incur costs and setup time. The project is actively updated, and leaderboard results may change dynamically.
3 weeks ago
1 day