OmniSQL by RUCKBReasoning

Text-to-SQL models and dataset for cross-domain applications

Created 1 year ago

432 stars

Top 68.8% on SourcePulse

Project Summary

OmniSQL is a family of text-to-SQL models and a large-scale dataset (SynSQL-2.5M) designed to improve the accuracy and robustness of natural language to SQL translation. It targets researchers and practitioners in NLP and database querying, offering state-of-the-art performance on various benchmarks.

How It Works

OmniSQL leverages a data synthesis framework that generates over 2.5 million text-to-SQL samples, including chain-of-thought solutions, across more than 16,000 synthetic databases. This framework uses open-source LLMs to create diverse data covering complex SQL queries and varied linguistic styles. The OmniSQL models (7B, 14B, 32B) are then fine-tuned on this synthetic data, augmented with human-labeled datasets like Spider and BIRD, to achieve high accuracy without additional post-processing steps.

Quick Start & Requirements

Install/Run: Inference can be performed using vLLM or Hugging Face Transformers.
Prerequisites: Python, vLLM (for vLLM inference), PyTorch (for Transformers inference), CUDA-enabled GPU (recommended for performance). Models are available on Modelscope and Hugging Face.
Links:
- Paper: https://arxiv.org/abs/2503.02240
- GitHub: https://github.com/RUCKBReasoning/OmniSQL
- Models/Dataset: Modelscope, HuggingFace

Highlighted Details

SynSQL-2.5M is the largest (2.5M samples) and most diverse synthetic text-to-SQL dataset to date.
OmniSQL models outperform comparable LLMs and even GPT-4o/DeepSeek-V3 on several benchmarks.
Achieves high accuracy without schema linking, SQL revision, or SQL selection components.
Supports SQLite dialect and includes chain-of-thought reasoning for all samples.

Maintenance & Community

Active development with recent updates (March 2025) including training/evaluation scripts and data synthesis framework code.
Contact: Haoyang Li (lihaoyang.cs@ruc.edu.cn) or GitHub Issues.

Licensing & Compatibility

SynSQL-2.5M dataset is released under Apache 2.0.
Model licenses are not explicitly stated but are typically permissive for research use.
Compatible with SQLite.

Limitations & Caveats

The dataset and models are primarily focused on English and the SQLite dialect, potentially limiting performance in multi-language or multi-SQL dialect scenarios. However, the data synthesis framework can be used to generate data for other scenarios.

Health Check

Last Commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

22 stars in the last 30 days