SQL evaluation tool for LLM-generated queries
Top 50.2% on sourcepulse
This repository provides a framework for evaluating the accuracy of Large Language Model (LLM) generated SQL queries against a database schema. It's designed for researchers and developers working on text-to-SQL systems, offering a robust method to benchmark LLM performance using a curated dataset derived from the Spider benchmark, enhanced with new questions and query categories.
How It Works
The evaluation process involves generating SQL queries (typically from an LLM), executing both the generated and "gold" queries against a database to retrieve results, and then comparing these results using "exact" and "subset" matching criteria. This approach allows for a quantitative assessment of query correctness, logging additional metrics like token usage and latency for comprehensive reporting.
Quick Start & Requirements
defog-data
repository, install Python dependencies (pip install -r requirements.txt
, pip install -e .
), and optionally download a spaCy model.defog-data
repository, with setup scripts provided.Highlighted Details
Maintenance & Community
Contributions are welcomed for dataset expansion, framework code improvements, and new generator/runner implementations. Further details are available in CONTRIBUTING.md
.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.
Limitations & Caveats
The README mentions that Llama CPP and MLX runners currently lack beam search, potentially impacting result quality. It also notes that populating the database with meaningful data is crucial to avoid false positives in evaluations.
2 months ago
1 day