Text-to-SQL models and dataset for cross-domain applications
Top 88.4% on sourcepulse
OmniSQL is a family of text-to-SQL models and a large-scale dataset (SynSQL-2.5M) designed to improve the accuracy and robustness of natural language to SQL translation. It targets researchers and practitioners in NLP and database querying, offering state-of-the-art performance on various benchmarks.
How It Works
OmniSQL leverages a data synthesis framework that generates over 2.5 million text-to-SQL samples, including chain-of-thought solutions, across more than 16,000 synthetic databases. This framework uses open-source LLMs to create diverse data covering complex SQL queries and varied linguistic styles. The OmniSQL models (7B, 14B, 32B) are then fine-tuned on this synthetic data, augmented with human-labeled datasets like Spider and BIRD, to achieve high accuracy without additional post-processing steps.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The dataset and models are primarily focused on English and the SQLite dialect, potentially limiting performance in multi-language or multi-SQL dialect scenarios. However, the data synthesis framework can be used to generate data for other scenarios.
2 months ago
Inactive