distil-text2sql by distil-labs

Local Text-to-SQL for plain English data querying

Created 3 months ago

276 stars

Top 93.7% on SourcePulse

Project Summary

This project provides a fine-tuned, small language model (SLM) for converting natural language questions into executable SQL queries. It targets users who need to query data locally, ensuring privacy, offline capability, and avoiding cloud dependencies. The key benefit is enabling users to interact with their CSV data using plain English, achieving accuracy comparable to much larger cloud-based LLMs while running efficiently on local hardware.

How It Works

The core approach involves fine-tuning the Qwen3 family of small language models on a dataset of approximately 10,000 synthetic Text2SQL examples. This process specifically trains the model to translate natural language questions and database schemas into correct SQL syntax. The project highlights that off-the-shelf small models struggle with this task, necessitating fine-tuning. The advantage lies in achieving high accuracy (80% LLM-as-a-Judge, 60% Exact Match with the 4B model) with significantly smaller model sizes (4B or 0.6B parameters) compared to large, cloud-hosted models, enabling local execution and enhanced privacy.

Quick Start & Requirements

Install Ollama: Follow instructions on the Ollama website.

Set up Environment:

python -m venv .venv
. .venv/bin/activate
pip install huggingface_hub openai pandas

Download and Build Model:

# Download the recommended 4-bit quantized model (~2.5GB)
huggingface-cli download distil-labs/distil-qwen3-4b-text2sql-gguf-4bit --local-dir distil-model
cd distil-model
ollama create distil-qwen3-4b-text2sql -f Modelfile
cd ..

Run Text2SQL:

python app.py --csv example_data/employees.csv --question "How many employees are in each department?"

Prerequisites: Ollama, Python 3.x, pip packages (huggingface_hub, openai, pandas).
Recommended Model: distil-qwen3-4b-text2sql-gguf-4bit for local use.
Links: Ollama (implied), Hugging Face Hub (for model downloads).

Highlighted Details

The fine-tuned 4B model matches a 685B teacher model on LLM-as-a-Judge accuracy (80%) and exceeds it on Exact Match (60% vs. 48%).
Queries typically return in under 2 seconds on a laptop (M4 MacBook Pro).
A 0.6B model variant is available for edge/mobile deployment, achieving 74% LLM-as-a-Judge accuracy.
The application loads CSV files into an in-memory SQLite database for querying.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), sponsorships, or roadmaps were found in the provided text.

Licensing & Compatibility

The specific open-source license for this project and its models is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require clarification of the underlying model and project licenses.

Limitations & Caveats

The model achieves approximately 80% accuracy, meaning roughly 1 in 5 generated SQL queries may require manual review or adjustment. Users are advised to always use the --show-sql flag to inspect generated queries before execution. The model generates SQLite-compatible SQL, and integration with other database systems may require manual adaptation of the SQL syntax.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days