XiYan-SQL  by XGenerationLab

Framework for text-to-SQL generation using LLMs

created 8 months ago
752 stars

Top 47.2% on sourcepulse

GitHubView on GitHub
Project Summary

XiYan-SQL is a framework designed to improve the accuracy and robustness of converting natural language queries into SQL statements. It targets researchers and developers working on text-to-SQL tasks, offering state-of-the-art performance through an ensemble of specialized models and advanced schema representation.

How It Works

The framework employs a multi-generator ensemble strategy, combining multiple SQL generation models to produce a diverse set of candidate queries. It utilizes "M-Schema," a semi-structured schema representation method, to enhance the model's understanding of database structures. XiYan-SQL integrates both in-context learning (ICL) with example selection based on named entity recognition and supervised fine-tuning strategies to generate high-quality, diverse SQL candidates. A refiner module corrects syntactical and logical errors, and a dedicated selection model is fine-tuned to identify the best candidate query.

Quick Start & Requirements

  • Models are available on HuggingFace and ModelScope.
  • The project supports local deployment via XiYan-MCP-server for high-security data access.
  • Specific hardware requirements are not detailed, but model sizes range from 3B to 32B parameters, suggesting significant computational resources may be needed for larger models.
  • Links: 🤗 XiYan GBI, 💻 M-Schema, 📖 Arxiv, PapersWithCode.

Highlighted Details

  • Achieved SOTA performance on the Bird leaderboard with an EX score of 75.63% and R-VES of 71.41%.
  • XiYanSQL-QwenCoder-32B model achieved SOTA on the Bird test set with an EX score of 69.03% as a single fine-tuned model.
  • Framework includes components for database description generation and a DateResolver model for enhanced date understanding, particularly for Chinese queries.
  • Offers multiple model sizes (3B, 7B, 14B, 32B) to cater to different developer needs.

Maintenance & Community

  • The project is actively developed, with frequent updates and releases noted in the README (e.g., new model versions, local server support).
  • A DingTalk group is available for community interaction: 94725009401.
  • The project welcomes contributions and feedback.

Licensing & Compatibility

  • The README does not explicitly state a license for the code or models. However, the project is open-sourced, and models are available on HuggingFace and ModelScope, suggesting permissive usage, but commercial use should be verified.

Limitations & Caveats

  • Some components, like the ensemble selection model and MoMQ (multi-dialect Text-to-SQL MoE model), are marked as "to release soon."
  • The DateResolver model is noted as being "major for Chinese," implying potential limitations for other languages.
  • While SOTA claims are made, specific benchmarks and comparisons are primarily against leaderboard scores rather than direct code comparisons.
Health Check
Last commit

3 weeks ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
2
Star History
161 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Andreas Jansson Andreas Jansson(Cofounder of Replicate).

natural-sql by cfahlgren1

0.1%
866
Text-to-SQL LLMs with strong performance
created 1 year ago
updated 1 year ago
Feedback? Help us improve.