OmniSQL  by RUCKBReasoning

Text-to-SQL models and dataset for cross-domain applications

created 5 months ago
307 stars

Top 88.4% on sourcepulse

GitHubView on GitHub
Project Summary

OmniSQL is a family of text-to-SQL models and a large-scale dataset (SynSQL-2.5M) designed to improve the accuracy and robustness of natural language to SQL translation. It targets researchers and practitioners in NLP and database querying, offering state-of-the-art performance on various benchmarks.

How It Works

OmniSQL leverages a data synthesis framework that generates over 2.5 million text-to-SQL samples, including chain-of-thought solutions, across more than 16,000 synthetic databases. This framework uses open-source LLMs to create diverse data covering complex SQL queries and varied linguistic styles. The OmniSQL models (7B, 14B, 32B) are then fine-tuned on this synthetic data, augmented with human-labeled datasets like Spider and BIRD, to achieve high accuracy without additional post-processing steps.

Quick Start & Requirements

Highlighted Details

  • SynSQL-2.5M is the largest (2.5M samples) and most diverse synthetic text-to-SQL dataset to date.
  • OmniSQL models outperform comparable LLMs and even GPT-4o/DeepSeek-V3 on several benchmarks.
  • Achieves high accuracy without schema linking, SQL revision, or SQL selection components.
  • Supports SQLite dialect and includes chain-of-thought reasoning for all samples.

Maintenance & Community

  • Active development with recent updates (March 2025) including training/evaluation scripts and data synthesis framework code.
  • Contact: Haoyang Li (lihaoyang.cs@ruc.edu.cn) or GitHub Issues.

Licensing & Compatibility

  • SynSQL-2.5M dataset is released under Apache 2.0.
  • Model licenses are not explicitly stated but are typically permissive for research use.
  • Compatible with SQLite.

Limitations & Caveats

The dataset and models are primarily focused on English and the SQLite dialect, potentially limiting performance in multi-language or multi-SQL dialect scenarios. However, the data synthesis framework can be used to generate data for other scenarios.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
81 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Andreas Jansson Andreas Jansson(Cofounder of Replicate).

natural-sql by cfahlgren1

0.1%
866
Text-to-SQL LLMs with strong performance
created 1 year ago
updated 1 year ago
Feedback? Help us improve.