sql-eval  by defog-ai

SQL evaluation tool for LLM-generated queries

created 2 years ago
690 stars

Top 50.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a framework for evaluating the accuracy of Large Language Model (LLM) generated SQL queries against a database schema. It's designed for researchers and developers working on text-to-SQL systems, offering a robust method to benchmark LLM performance using a curated dataset derived from the Spider benchmark, enhanced with new questions and query categories.

How It Works

The evaluation process involves generating SQL queries (typically from an LLM), executing both the generated and "gold" queries against a database to retrieve results, and then comparing these results using "exact" and "subset" matching criteria. This approach allows for a quantitative assessment of query correctness, logging additional metrics like token usage and latency for comprehensive reporting.

Quick Start & Requirements

  • Installation: Clone defog-data repository, install Python dependencies (pip install -r requirements.txt, pip install -e .), and optionally download a spaCy model.
  • Database Setup: Requires a PostgreSQL instance, recommended via Docker. Instructions are provided for creating, starting, and persisting the database.
  • Data Import: Data for multiple database types (Postgres, Snowflake, BigQuery, MySQL, SQLite, SQL Server) is available in the defog-data repository, with setup scripts provided.
  • Prerequisites: Basic command-line familiarity, Docker, SQL client, Python data manipulation libraries.

Highlighted Details

  • Supports evaluation across multiple database backends including PostgreSQL, Snowflake, BigQuery, MySQL, SQLite, and SQL Server.
  • Integrates with various LLM inference methods: OpenAI, Anthropic, Hugging Face (including PEFT adapters and vLLM), AWS Bedrock, Together.ai, Llama CPP, MLX, Gemini, Mistral, and Deepseek.
  • Offers flexible prompting and execution configurations, including parallel processing, beam search, and chain-of-thought execution.
  • Includes optional cloud function deployment for uploading results to BigQuery or PostgreSQL.

Maintenance & Community

Contributions are welcomed for dataset expansion, framework code improvements, and new generator/runner implementations. Further details are available in CONTRIBUTING.md.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.

Limitations & Caveats

The README mentions that Llama CPP and MLX runners currently lack beam search, potentially impacting result quality. It also notes that populating the database with meaningful data is crucial to avoid false positives in evaluations.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
36 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.