CHESS  by ShayanTalaei

LLM-powered multi-agent framework for efficient SQL synthesis

Created 1 year ago
254 stars

Top 99.0% on SourcePulse

GitHubView on GitHub
Project Summary

Contextual Harnessing for Efficient SQL Synthesis (CHESS) addresses the long-standing challenge of translating natural language questions into SQL queries, particularly when dealing with large database catalogs and ambiguous language. It offers an LLM-based multi-agent framework designed for efficient, scalable, and accurate SQL generation, targeting researchers and engineers in the text-to-SQL domain seeking robust industrial solutions.

How It Works

CHESS employs a modular, multi-agent architecture comprising four specialized agents: Information Retriever (IR) for data extraction, Schema Selector (SS) for pruning large schemas, Candidate Generator (CG) for iterative query refinement, and Unit Tester (UT) for LLM-based validation. This approach systematically tackles challenges like extensive database catalogs, schema reasoning, query validity, and natural language ambiguity. The Schema Selector agent is a key differentiator, significantly reducing LLM token usage by 5x while improving accuracy.

Quick Start & Requirements

Installation involves cloning the repository, creating a .env file with necessary API keys (OpenAI, Google Cloud) and configuration paths, and installing dependencies via pip install -r requirements.txt. A preprocessing step (sh run/run_preprocess.sh) is mandatory to generate database indexes (minhash, LSH, vector). Core execution commands include sh run/run_main_ir_cg_ut.sh or sh run/run_main_ir_ss_ch.sh.

Highlighted Details

  • Industrial-Scale Database Support: The Schema Selector agent efficiently narrows down large schemas, improving accuracy by approximately 2% and reducing LLM token usage by 5x.
  • Privacy-Preserving Performance: Achieves state-of-the-art results among open-source models, making it suitable for industrial deployment.
  • Scalability: Demonstrates 71.10% accuracy on the BIRD test set, closely matching proprietary methods while reducing LLM calls by approximately 83%.
  • Sub-sampled Development Set (SDS): A 10% subset of the BIRD dataset is provided for ablation studies.

Maintenance & Community

The provided README does not contain information regarding maintainers, community channels (e.g., Discord, Slack), or project roadmaps.

Licensing & Compatibility

The README does not specify the software license. This omission requires clarification for potential adoption, especially concerning commercial use or integration with closed-source systems.

Limitations & Caveats

The setup requires specific API keys (OpenAI, Google Cloud) and a multi-step preprocessing phase. Performance metrics are reported on the BIRD dataset, and the framework's applicability to other datasets or LLMs may require modifications as outlined in the run/langchain_utils.py file.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Andreas Jansson Andreas Jansson(Cofounder of Replicate).

natural-sql by cfahlgren1

0%
866
Text-to-SQL LLMs with strong performance
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.