sqlcoder  by defog-ai

LLM for natural language to SQL conversion

Created 2 years ago
3,932 stars

Top 12.4% on SourcePulse

GitHubView on GitHub
Project Summary

SQLCoder is a family of state-of-the-art Large Language Models (LLMs) designed for converting natural language questions into SQL queries. It targets developers and data analysts needing to interact with databases using natural language, offering performance that surpasses leading proprietary models like GPT-4 on specific benchmarks.

How It Works

SQLCoder models are trained on over 20,000 human-curated natural language-to-SQL query pairs across 10 diverse database schemas. The models are fine-tuned to excel at generating accurate SQL from natural language prompts, with a focus on various SQL constructs like JOIN, WHERE, and GROUP BY clauses.

Quick Start & Requirements

  • NVIDIA GPU (>16GB VRAM): pip install "sqlcoder[transformers]"
  • Apple Silicon: CMAKE_ARGS="-DLLAMA_METAL=on" pip install "sqlcoder[llama-cpp]"
  • Linux/Intel Mac (CPU): CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install "sqlcoder[llama-cpp]"
  • Windows (CPU): $env:CMAKE_ARGS = "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install "sqlcoder[llama-cpp]"
  • Hardware: Tested with 4xA10 GPUs for float16. Quantized versions (8-bit, 4-bit) run on consumer GPUs (e.g., RTX 4090/3090) or Apple Silicon with >=20GB VRAM.
  • Demo: Interactive Demo
  • Colab: ♾️ Colab

Highlighted Details

  • Outperforms GPT-4 and GPT-4 Turbo on the sql-eval framework for NL-to-SQL tasks.
  • Achieves high accuracy across categories including date, group_by, order_by, ratio, join, and where.
  • Trained on a dataset with schemas distinct from evaluation data to ensure generalization.
  • Offers multiple model sizes (e.g., 7B, 34B, 70B) for different performance and hardware requirements.

Maintenance & Community

  • Active development by Defog.ai.
  • Community support via Twitter.

Licensing & Compatibility

  • Code: Apache-2.0 license.
  • Model Weights: CC BY-SA 4.0 license. Allows commercial use, but requires modified weights to be open-sourced under the same license.

Limitations & Caveats

  • Performance on non-Apple Silicon CPUs is less optimized due to quantization and lack of beam search.
  • Testing on platforms other than Linux/Intel Mac/Windows is limited; contributions are welcomed.
  • Future work includes Reward Modelling and RLHF tuning.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
32 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Andreas Jansson Andreas Jansson(Cofounder of Replicate).

natural-sql by cfahlgren1

0%
865
Text-to-SQL LLMs with strong performance
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.