sqlcoder by defog-ai

LLM for natural language to SQL conversion

Created 2 years ago

3,980 stars

Top 12.2% on SourcePulse

View on GitHub

5 Experts Love This Project

Dharmesh Shah

Cofounder of HubSpot

Binyuan Hui

Research Scientist at Alibaba Qwen

Shyamal Anadkat

Research Scientist at OpenAI

Pawel Garbacki

Cofounder of Fireworks AI

and 1 more!

Project Summary

SQLCoder is a family of state-of-the-art Large Language Models (LLMs) designed for converting natural language questions into SQL queries. It targets developers and data analysts needing to interact with databases using natural language, offering performance that surpasses leading proprietary models like GPT-4 on specific benchmarks.

How It Works

SQLCoder models are trained on over 20,000 human-curated natural language-to-SQL query pairs across 10 diverse database schemas. The models are fine-tuned to excel at generating accurate SQL from natural language prompts, with a focus on various SQL constructs like JOIN, WHERE, and GROUP BY clauses.

Quick Start & Requirements

NVIDIA GPU (>16GB VRAM): pip install "sqlcoder[transformers]"
Apple Silicon: CMAKE_ARGS="-DLLAMA_METAL=on" pip install "sqlcoder[llama-cpp]"
Linux/Intel Mac (CPU): CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install "sqlcoder[llama-cpp]"
Windows (CPU): $env:CMAKE_ARGS = "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install "sqlcoder[llama-cpp]"
Hardware: Tested with 4xA10 GPUs for float16. Quantized versions (8-bit, 4-bit) run on consumer GPUs (e.g., RTX 4090/3090) or Apple Silicon with >=20GB VRAM.
Demo: Interactive Demo
Colab: ♾️ Colab

Highlighted Details

Outperforms GPT-4 and GPT-4 Turbo on the sql-eval framework for NL-to-SQL tasks.
Achieves high accuracy across categories including date, group_by, order_by, ratio, join, and where.
Trained on a dataset with schemas distinct from evaluation data to ensure generalization.
Offers multiple model sizes (e.g., 7B, 34B, 70B) for different performance and hardware requirements.

Maintenance & Community

Active development by Defog.ai.
Community support via Twitter.

Licensing & Compatibility

Code: Apache-2.0 license.
Model Weights: CC BY-SA 4.0 license. Allows commercial use, but requires modified weights to be open-sourced under the same license.

Limitations & Caveats

Performance on non-Apple Silicon CPUs is less optimized due to quantization and lack of beam search.
Testing on platforms other than Linux/Intel Mac/Windows is limited; contributions are welcomed.
Future work includes Reward Modelling and RLHF tuning.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

18 stars in the last 30 days