NSQL by NumbersStationAI

Open-source Text-to-SQL foundation models for efficient database interaction

Created 2 years ago

254 stars

Top 99.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

Numbers Station AI's NSQL is a family of open-source, autoregressive large foundation models specifically engineered for Text-to-SQL tasks. It addresses the need for accurate and efficient conversion of natural language queries into executable SQL statements, targeting engineers, researchers, and power users working with relational databases. The models offer a range of sizes, from 350M to 7B parameters, providing flexibility for deployment scenarios, including local execution with enhanced privacy.

How It Works

NSQL models leverage an autoregressive architecture, a common approach for sequence generation tasks, but are specialized for SQL. This focus allows them to achieve high performance on complex query structures, including joins and nested subqueries, often outperforming larger, more general-purpose models. The models are designed to understand database schemas and translate user intents into syntactically correct and semantically accurate SQL queries.

Quick Start & Requirements

Install: Run pip install -r requirements.txt.
Prerequisites: Python environment, manifest library, and database connectors (examples provided for Postgres and SQLite). Model weights are available on HuggingFace.
Usage: Start a local API server using python3 -m manifest.api.app ... and then interact with it via a Python client, as demonstrated in the examples/ directory.
Links: Model weights on HuggingFace.

Highlighted Details

The NSQL-llama-2-7B model achieves near-parity with GPT-4 on the Spider benchmark's overall execution accuracy (78.1% vs. 76.2%) while being approximately 250 times smaller.
It significantly outperforms GPT-4 on complex queries, showing +43% better performance on Join queries and +54% on Nested queries.
NSQL models demonstrate superior Matching Accuracy, indicating structurally correct SQL generation.
The availability of smaller models (e.g., 350M, 2B, 6B) enables local deployment, ensuring data privacy and reducing reliance on external APIs.

Maintenance & Community

The project lists Vishal Motwani, Sen Wu, and Laurel Orr as contributors. No specific community channels (like Discord or Slack) or roadmap details are provided in the README.

Licensing & Compatibility

The code in this repository is licensed under the permissive Apache 2.0 license, which generally allows for commercial use. However, the datasets used for training NSQL models have diverse licenses, including CC-BY-4.0, MIT, Apache-2.0, BSD 3-Clause, and others. Users must adhere to the terms of these original dataset licenses, including any attribution requirements, which may impose restrictions on derived works or redistribution.

Limitations & Caveats

The primary caveat for adoption is the varied licensing of the training data; users must carefully review and comply with the original licenses of each dataset used in the NSText2SQL corpus. The README does not specify any known bugs, unsupported platforms, or deprecation plans.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days