OmniSQL  by RUCKBReasoning

Text-to-SQL models and dataset for cross-domain applications

Created 8 months ago
373 stars

Top 75.9% on SourcePulse

GitHubView on GitHub
Project Summary

OmniSQL is a family of text-to-SQL models and a large-scale dataset (SynSQL-2.5M) designed to improve the accuracy and robustness of natural language to SQL translation. It targets researchers and practitioners in NLP and database querying, offering state-of-the-art performance on various benchmarks.

How It Works

OmniSQL leverages a data synthesis framework that generates over 2.5 million text-to-SQL samples, including chain-of-thought solutions, across more than 16,000 synthetic databases. This framework uses open-source LLMs to create diverse data covering complex SQL queries and varied linguistic styles. The OmniSQL models (7B, 14B, 32B) are then fine-tuned on this synthetic data, augmented with human-labeled datasets like Spider and BIRD, to achieve high accuracy without additional post-processing steps.

Quick Start & Requirements

Highlighted Details

  • SynSQL-2.5M is the largest (2.5M samples) and most diverse synthetic text-to-SQL dataset to date.
  • OmniSQL models outperform comparable LLMs and even GPT-4o/DeepSeek-V3 on several benchmarks.
  • Achieves high accuracy without schema linking, SQL revision, or SQL selection components.
  • Supports SQLite dialect and includes chain-of-thought reasoning for all samples.

Maintenance & Community

  • Active development with recent updates (March 2025) including training/evaluation scripts and data synthesis framework code.
  • Contact: Haoyang Li (lihaoyang.cs@ruc.edu.cn) or GitHub Issues.

Licensing & Compatibility

  • SynSQL-2.5M dataset is released under Apache 2.0.
  • Model licenses are not explicitly stated but are typically permissive for research use.
  • Compatible with SQLite.

Limitations & Caveats

The dataset and models are primarily focused on English and the SQLite dialect, potentially limiting performance in multi-language or multi-SQL dialect scenarios. However, the data synthesis framework can be used to generate data for other scenarios.

Health Check
Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
30 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.