DAIL-SQL by BeachWang

Few-shot NL2SQL method for GPT-4

Created 2 years ago

613 stars

Top 53.7% on SourcePulse

Project Summary

DAIL-SQL is a few-shot Text-to-SQL method designed to optimize Large Language Model (LLM) performance on complex database querying tasks. It targets researchers and practitioners in natural language processing and database management, offering a significant boost in accuracy on benchmarks like Spider.

How It Works

DAIL-SQL enhances Text-to-SQL by encoding structural knowledge as SQL statements, selecting relevant examples based on both question and query similarity, and optimizing token efficiency by excluding cross-domain knowledge. This approach leverages LLMs' in-context learning capabilities by carefully curating prompt content, leading to improved accuracy and reduced computational cost.

Quick Start & Requirements

Installation: Requires Python 3.8+ and conda. Install dependencies via pip install -r requirements.txt after setting up the environment.
Prerequisites:
- Stanford CoreNLP server: Download and run java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer & from ./third_party/stanford-corenlp-full-2018-10-05.
- NLTK data: Run python nltk_downloader.py.
- Spider dataset: Download to ./dataset/spider.
- OpenAI API key for GPT-4 or GPT-3.5-turbo.
Setup Time: Moderate, involving dataset download, CoreNLP setup, and environment configuration.
Links: Stanford CoreNLP, Spider Benchmark

Highlighted Details

Achieved 86.6% execution accuracy on the Spider leaderboard using GPT-4 with self-consistency voting.
Demonstrates superior performance by requiring only ~1600 tokens per question on Spider-dev.
Empirically evaluates various prompt engineering strategies, including question representations, example selection, and organization.
Selects examples considering both question similarity and query similarity for optimal few-shot learning.

Maintenance & Community

The project is associated with authors from institutions like Shanghai Jiao Tong University and Microsoft.
Code for schema-linking is inspired by RAT-SQL, and self-consistency voting by C3SQL.
No explicit community channels (Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. The project is presented as open-source, but users should verify licensing for commercial or closed-source use.

Limitations & Caveats

Requires a running Stanford CoreNLP server, adding an external dependency.
Relies on OpenAI's proprietary models (GPT-4, GPT-3.5-turbo), necessitating API access and associated costs.
The "pre_test_result" parameter for generate_question.py implies a dependency on pre-generated queries for certain selection strategies.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

1

Star History

6 stars in the last 30 days

Explore Similar Projects

Awesome-Efficient-Reasoning by hemingkx

Paper list for efficient reasoning in large language models

Created 10 months ago

Updated 1 day ago

Starred by

Yiran Wu

Yiran Wu(Coauthor of AutoGen),

Benny Chen

Benny Chen(Cofounder of Fireworks AI), and

1 more.

cumulative-reasoning by iiis-ai

Research paper implementation for cumulative reasoning with LLMs

Created 2 years ago

Updated 5 months ago

chain-of-draft by sileix

Research paper code for efficient LLM reasoning

Created 10 months ago

Updated 10 months ago

r1-reasoning-rag by deansaco

Agentic RAG system using recursive reasoning

Created 11 months ago

Updated 7 months ago

Starred by

Yiran Wu

Yiran Wu(Coauthor of AutoGen).

large-qa-datasets by ad-freiburg

Collection of question answering datasets for NLP tasks

Created 5 years ago

Updated 1 year ago

gpqa by idavidrein

Benchmark for graduate-level, Google-proof question answering

Created 3 years ago

Updated 1 year ago

quality-prompts by sarthakrastogi

Python library for prompt engineering research

Created 1 year ago

Updated 1 year ago

knowledge-graph-from-GPT by tomhartke

Knowledge graph for agentic LM research assistant

Created 2 years ago

Updated 2 months ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory), and

1 more.

rag-demystified by pchunduri6

LLM-powered RAG pipeline for question answering, built from scratch

Created 2 years ago

Updated 1 year ago

examor by codeacme17

LLM-assisted learning app for knowledge reinforcement

Created 2 years ago

Updated 6 months ago

Starred by

Shyamal Anadkat

Shyamal Anadkat(Research Scientist at OpenAI) and

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

tree-of-thought-prompting by dave1010

Prompt engineering for enhanced LLM reasoning

Created 2 years ago

Updated 2 years ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA) and

Vaibhav Nivargi

Vaibhav Nivargi(Cofounder of Moveworks).

acl2020-openqa-tutorial by danqi

Tutorial for open-domain question answering research

Created 5 years ago

Updated 5 years ago

Feedback? Help us improve.