DB-GPT-Hub by eosphoros-ai

DB-GPT-Hub: LLM fine-tuning for Text-to-SQL parsing

Created 2 years ago

1,949 stars

Top 22.3% on SourcePulse

Project Summary

This repository provides a framework for enhancing Text-to-SQL capabilities using Large Language Models (LLMs). It offers a comprehensive workflow for data processing, model fine-tuning (SFT), prediction, and evaluation, aiming to reduce training costs and improve accuracy for database querying via natural language. The target audience includes researchers and developers working on Text-to-SQL solutions.

How It Works

DB-GPT-Hub leverages Supervised Fine-Tuning (SFT) on various LLMs, including CodeLlama, Llama2, and Qwen, using techniques like LoRA and QLoRA. It processes datasets such as Spider, WikiSQL, and BIRD-SQL, employing an information matching generation method that combines table information with natural language queries to produce accurate SQL. The framework supports multiple fine-tuning and prediction methods, with a focus on optimizing performance and reducing computational requirements.

Quick Start & Requirements

Install: pip install dbgpt-hub
Prerequisites: Python 3.10, Git. Fine-tuning requires significant GPU RAM (e.g., 6GB for 7B models, 13.4GB for 13B models) and disk space.
Setup: Clone the repository, create a conda environment, and install dependencies. Data preprocessing involves running a shell script.
Docs: Official Docs

Highlighted Details

Supports fine-tuning for Text-to-SQL, Text-to-NLU, and Text-to-GQL.
Achieved 0.764 execution accuracy on a 1.27G database with a fine-tuned 13B model (zero-shot).
Offers fine-tuning via LoRA and QLoRA, with configurable parameters for various LLM architectures.
Includes scripts for data preprocessing, training, prediction, evaluation, and model weight merging.

Maintenance & Community

Active community with Discord and WeChat channels for support and contributions.
Regular updates and roadmap outlining future development, including inference optimization and Chinese language support.
Welcomes contributions via issues and pull requests.

Licensing & Compatibility

MIT License. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The project is described as experimental.
Performance benchmarks are provided for specific models and datasets; results may vary with different configurations or databases.
Some advanced features like DeepSpeed multi-GPU training require specific configuration adjustments.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

15 stars in the last 30 days

Explore Similar Projects

SuperAdapters by cckuailong

CLI tool for finetuning and inference of LLMs using adapters

Created 2 years ago

Updated 5 months ago

OmniSQL by RUCKBReasoning

Text-to-SQL models and dataset for cross-domain applications

Created 10 months ago

Updated 4 months ago

Starred by

Akshat Bubna

Akshat Bubna(Cofounder of Modal) and

Jerry Liu

Jerry Liu(Cofounder of LlamaIndex).

modal_finetune_sql by run-llama

Walkthrough for fine-tuning LLaMa 2 7B on Text-to-SQL datasets

Created 2 years ago

Updated 2 years ago

sql-eval by defog-ai

SQL evaluation tool for LLM-generated queries

Created 2 years ago

Updated 5 months ago

Awesome-LLM-based-Text2SQL by DEEP-PolyU

Advancing LLM-based Text-to-SQL generation

Created 4 months ago

Updated 1 month ago

FAQ_Of_LLM_Interview by aceliuchanghong

Interview prep for LLM algorithm roles

Created 1 year ago

Updated 4 days ago

Starred by

Binyuan Hui

Binyuan Hui(Research Scientist at Alibaba Qwen).

Spider2 by xlang-ai

Benchmark dataset for text-to-SQL evaluation in enterprise settings

Created 1 year ago

Updated 2 months ago

Starred by

Casper Hansen

Casper Hansen(Author of AutoAWQ) and

Maxime Labonne

Maxime Labonne(Head of Post-Training at Liquid AI).

OpenCoder-llm by OpenCoder-llm

Open code LLM family (1.5B/8B) for English and Chinese

Created 1 year ago

Updated 1 year ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Daniel Han

Daniel Han(Cofounder of Unsloth), and

1 more.

synthetic-data-kit by meta-llama

Synthetic data CLI tool for LLM fine-tuning

Created 9 months ago

Updated 2 months ago

KoAlpaca by Beomi

Korean LLM fine-tuning project

Created 2 years ago

Updated 1 year ago

Awesome-Text2SQL by eosphoros-ai

Curated list of Text2SQL resources for LLMs, DSLs, APIs, and visualization

Created 2 years ago

Updated 6 days ago

Starred by

Chaoyu Yang

Chaoyu Yang(Founder of Bento),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

3 more.

DB-GPT by eosphoros-ai

AI-native data app development framework with agentic workflow

Created 2 years ago

Updated 3 days ago

Feedback? Help us improve.