XiYan-SQL by XGenerationLab

Framework for text-to-SQL generation using LLMs

Created 1 year ago

952 stars

Top 38.5% on SourcePulse

Project Summary

XiYan-SQL is a framework designed to improve the accuracy and robustness of converting natural language queries into SQL statements. It targets researchers and developers working on text-to-SQL tasks, offering state-of-the-art performance through an ensemble of specialized models and advanced schema representation.

How It Works

The framework employs a multi-generator ensemble strategy, combining multiple SQL generation models to produce a diverse set of candidate queries. It utilizes "M-Schema," a semi-structured schema representation method, to enhance the model's understanding of database structures. XiYan-SQL integrates both in-context learning (ICL) with example selection based on named entity recognition and supervised fine-tuning strategies to generate high-quality, diverse SQL candidates. A refiner module corrects syntactical and logical errors, and a dedicated selection model is fine-tuned to identify the best candidate query.

Quick Start & Requirements

Models are available on HuggingFace and ModelScope.
The project supports local deployment via XiYan-MCP-server for high-security data access.
Specific hardware requirements are not detailed, but model sizes range from 3B to 32B parameters, suggesting significant computational resources may be needed for larger models.
Links: 🤗 XiYan GBI, 💻 M-Schema, 📖 Arxiv, PapersWithCode.

Highlighted Details

Achieved SOTA performance on the Bird leaderboard with an EX score of 75.63% and R-VES of 71.41%.
XiYanSQL-QwenCoder-32B model achieved SOTA on the Bird test set with an EX score of 69.03% as a single fine-tuned model.
Framework includes components for database description generation and a DateResolver model for enhanced date understanding, particularly for Chinese queries.
Offers multiple model sizes (3B, 7B, 14B, 32B) to cater to different developer needs.

Maintenance & Community

The project is actively developed, with frequent updates and releases noted in the README (e.g., new model versions, local server support).
A DingTalk group is available for community interaction: 94725009401.
The project welcomes contributions and feedback.

Licensing & Compatibility

The README does not explicitly state a license for the code or models. However, the project is open-sourced, and models are available on HuggingFace and ModelScope, suggesting permissive usage, but commercial use should be verified.

Limitations & Caveats

Some components, like the ensemble selection model and MoMQ (multi-dialect Text-to-SQL MoE model), are marked as "to release soon."
The DateResolver model is noted as being "major for Chinese," implying potential limitations for other languages.
While SOTA claims are made, specific benchmarks and comparisons are primarily against leaderboard scores rather than direct code comparisons.

Health Check

Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)

0

Issues (30d)

1

Star History

59 stars in the last 30 days

Explore Similar Projects

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Andreas Jansson

Andreas Jansson(Cofounder of Replicate).

natural-sql by cfahlgren1

Text-to-SQL LLMs with strong performance

Created 2 years ago

Updated 1 year ago

suql by stanford-oval

Research paper for conversational search over hybrid data sources

Created 2 years ago

Updated 2 months ago

Starred by

Binyuan Hui

Binyuan Hui(Research Scientist at Alibaba Qwen) and

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

Table-Pretraining by microsoft

Research paper for table pre-training via neural SQL execution

Created 4 years ago

Updated 2 years ago

OmniSQL by RUCKBReasoning

Text-to-SQL models and dataset for cross-domain applications

Created 10 months ago

Updated 4 months ago

LLM-Text-to-SQL-Architectures by arunpshankar

LLM-powered Text-to-SQL architectures

Created 2 years ago

Updated 2 years ago

rookie_text2data by jaguarliuu

Natural language to SQL query generation tool

Created 10 months ago

Updated 2 months ago

CHESS by ShayanTalaei

LLM-powered multi-agent framework for efficient SQL synthesis

Created 1 year ago

Updated 7 months ago

Starred by

John Yang

John Yang(Coauthor of SWE-bench, SWE-agent).

TAG-Bench by TAG-Research

Benchmark for table-augmented generation (TAG) research

Created 1 year ago

Updated 9 months ago

Awesome-LLM-based-Text2SQL by DEEP-PolyU

Advancing LLM-based Text-to-SQL generation

Created 4 months ago

Updated 1 month ago

Starred by

Binyuan Hui

Binyuan Hui(Research Scientist at Alibaba Qwen).

Spider2 by xlang-ai

Benchmark dataset for text-to-SQL evaluation in enterprise settings

Created 1 year ago

Updated 2 months ago

llm-driven-data-engineering by DataExpert-io

LLM-driven data engineering examples and tutorials

Created 2 years ago

Updated 1 year ago

Starred by

Dharmesh Shah

Dharmesh Shah(Cofounder of HubSpot),

Binyuan Hui

Binyuan Hui(Research Scientist at Alibaba Qwen), and

3 more.

sqlcoder by defog-ai

LLM for natural language to SQL conversion

Created 2 years ago

Updated 1 year ago

Feedback? Help us improve.