CHESS by ShayanTalaei

LLM-powered multi-agent framework for efficient SQL synthesis

Created 2 years ago

273 stars

Top 94.6% on SourcePulse

Project Summary

Contextual Harnessing for Efficient SQL Synthesis (CHESS) addresses the long-standing challenge of translating natural language questions into SQL queries, particularly when dealing with large database catalogs and ambiguous language. It offers an LLM-based multi-agent framework designed for efficient, scalable, and accurate SQL generation, targeting researchers and engineers in the text-to-SQL domain seeking robust industrial solutions.

How It Works

CHESS employs a modular, multi-agent architecture comprising four specialized agents: Information Retriever (IR) for data extraction, Schema Selector (SS) for pruning large schemas, Candidate Generator (CG) for iterative query refinement, and Unit Tester (UT) for LLM-based validation. This approach systematically tackles challenges like extensive database catalogs, schema reasoning, query validity, and natural language ambiguity. The Schema Selector agent is a key differentiator, significantly reducing LLM token usage by 5x while improving accuracy.

Quick Start & Requirements

Installation involves cloning the repository, creating a .env file with necessary API keys (OpenAI, Google Cloud) and configuration paths, and installing dependencies via pip install -r requirements.txt. A preprocessing step (sh run/run_preprocess.sh) is mandatory to generate database indexes (minhash, LSH, vector). Core execution commands include sh run/run_main_ir_cg_ut.sh or sh run/run_main_ir_ss_ch.sh.

Highlighted Details

Industrial-Scale Database Support: The Schema Selector agent efficiently narrows down large schemas, improving accuracy by approximately 2% and reducing LLM token usage by 5x.
Privacy-Preserving Performance: Achieves state-of-the-art results among open-source models, making it suitable for industrial deployment.
Scalability: Demonstrates 71.10% accuracy on the BIRD test set, closely matching proprietary methods while reducing LLM calls by approximately 83%.
Sub-sampled Development Set (SDS): A 10% subset of the BIRD dataset is provided for ablation studies.

Maintenance & Community

The provided README does not contain information regarding maintainers, community channels (e.g., Discord, Slack), or project roadmaps.

Licensing & Compatibility

The README does not specify the software license. This omission requires clarification for potential adoption, especially concerning commercial use or integration with closed-source systems.

Limitations & Caveats

The setup requires specific API keys (OpenAI, Google Cloud) and a multi-step preprocessing phase. Performance metrics are reported on the BIRD dataset, and the framework's applicability to other datasets or LLMs may require modifications as outlined in the run/langchain_utils.py file.

CHESS by ShayanTalaei

Explore Similar Projects

dinobase by DinobaseHQ

natural-sql by cfahlgren1

dat by hexinfo

XiYan-SQL by XGenerationLab

LLM-Text-to-SQL-Architectures by arunpshankar

rookie_text2data by jaguarliuu

universal-db-mcp by Anarkh-Lee

TAG-Bench by TAG-Research

Awesome-LLM-based-Text2SQL by DEEP-PolyU

Spider2 by xlang-ai

sqlcoder by defog-ai

WrenAI by Canner