sql-eval by defog-ai

SQL evaluation tool for LLM-generated queries

Created 2 years ago

728 stars

Top 47.5% on SourcePulse

Project Summary

This repository provides a framework for evaluating the accuracy of Large Language Model (LLM) generated SQL queries against a database schema. It's designed for researchers and developers working on text-to-SQL systems, offering a robust method to benchmark LLM performance using a curated dataset derived from the Spider benchmark, enhanced with new questions and query categories.

How It Works

The evaluation process involves generating SQL queries (typically from an LLM), executing both the generated and "gold" queries against a database to retrieve results, and then comparing these results using "exact" and "subset" matching criteria. This approach allows for a quantitative assessment of query correctness, logging additional metrics like token usage and latency for comprehensive reporting.

Quick Start & Requirements

Installation: Clone defog-data repository, install Python dependencies (pip install -r requirements.txt, pip install -e .), and optionally download a spaCy model.
Database Setup: Requires a PostgreSQL instance, recommended via Docker. Instructions are provided for creating, starting, and persisting the database.
Data Import: Data for multiple database types (Postgres, Snowflake, BigQuery, MySQL, SQLite, SQL Server) is available in the defog-data repository, with setup scripts provided.
Prerequisites: Basic command-line familiarity, Docker, SQL client, Python data manipulation libraries.

Highlighted Details

Supports evaluation across multiple database backends including PostgreSQL, Snowflake, BigQuery, MySQL, SQLite, and SQL Server.
Integrates with various LLM inference methods: OpenAI, Anthropic, Hugging Face (including PEFT adapters and vLLM), AWS Bedrock, Together.ai, Llama CPP, MLX, Gemini, Mistral, and Deepseek.
Offers flexible prompting and execution configurations, including parallel processing, beam search, and chain-of-thought execution.
Includes optional cloud function deployment for uploading results to BigQuery or PostgreSQL.

Maintenance & Community

Contributions are welcomed for dataset expansion, framework code improvements, and new generator/runner implementations. Further details are available in CONTRIBUTING.md.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.

Limitations & Caveats

The README mentions that Llama CPP and MLX runners currently lack beam search, potentially impacting result quality. It also notes that populating the database with meaningful data is crucial to avoid false positives in evaluations.

sql-eval by defog-ai

Explore Similar Projects

SQL-GPT by CL-lau

pgassistant by nexsol-technologies

OmniSQL by RUCKBReasoning

MindSQL by Mindinventory

mcp-alchemy by runekaagaard

Spider2 by xlang-ai

training-plan by CDDSCLab

llm-driven-data-engineering by DataExpert-io

Awesome-Text2SQL by eosphoros-ai

sqlcoder by defog-ai

WrenAI by Canner

vanna by vanna-ai