DAIL-SQL  by BeachWang

Few-shot NL2SQL method for GPT-4

created 1 year ago
571 stars

Top 57.3% on sourcepulse

GitHubView on GitHub
Project Summary

DAIL-SQL is a few-shot Text-to-SQL method designed to optimize Large Language Model (LLM) performance on complex database querying tasks. It targets researchers and practitioners in natural language processing and database management, offering a significant boost in accuracy on benchmarks like Spider.

How It Works

DAIL-SQL enhances Text-to-SQL by encoding structural knowledge as SQL statements, selecting relevant examples based on both question and query similarity, and optimizing token efficiency by excluding cross-domain knowledge. This approach leverages LLMs' in-context learning capabilities by carefully curating prompt content, leading to improved accuracy and reduced computational cost.

Quick Start & Requirements

  • Installation: Requires Python 3.8+ and conda. Install dependencies via pip install -r requirements.txt after setting up the environment.
  • Prerequisites:
    • Stanford CoreNLP server: Download and run java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer & from ./third_party/stanford-corenlp-full-2018-10-05.
    • NLTK data: Run python nltk_downloader.py.
    • Spider dataset: Download to ./dataset/spider.
    • OpenAI API key for GPT-4 or GPT-3.5-turbo.
  • Setup Time: Moderate, involving dataset download, CoreNLP setup, and environment configuration.
  • Links: Stanford CoreNLP, Spider Benchmark

Highlighted Details

  • Achieved 86.6% execution accuracy on the Spider leaderboard using GPT-4 with self-consistency voting.
  • Demonstrates superior performance by requiring only ~1600 tokens per question on Spider-dev.
  • Empirically evaluates various prompt engineering strategies, including question representations, example selection, and organization.
  • Selects examples considering both question similarity and query similarity for optimal few-shot learning.

Maintenance & Community

  • The project is associated with authors from institutions like Shanghai Jiao Tong University and Microsoft.
  • Code for schema-linking is inspired by RAT-SQL, and self-consistency voting by C3SQL.
  • No explicit community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The project is presented as open-source, but users should verify licensing for commercial or closed-source use.

Limitations & Caveats

  • Requires a running Stanford CoreNLP server, adding an external dependency.
  • Relies on OpenAI's proprietary models (GPT-4, GPT-3.5-turbo), necessitating API access and associated costs.
  • The "pre_test_result" parameter for generate_question.py implies a dependency on pre-generated queries for certain selection strategies.
Health Check
Last commit

4 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
43 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.