TAG-Bench by TAG-Research

Benchmark for table-augmented generation (TAG) research

Created 1 year ago

767 stars

Top 45.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

John Yang

Coauthor of SWE-bench, SWE-agent

Project Summary

TAG-Bench provides a benchmark and framework for Table-Augmented Generation (TAG), a paradigm for answering natural language questions over databases by unifying Large Language Models (LLMs) with database interactions. It targets researchers and practitioners in NLP and database communities, offering a standardized way to evaluate and advance methods that go beyond simple Text2SQL or RAG.

How It Works

TAG extends traditional Text2SQL and RAG by enabling more complex interactions between LLMs and databases. The TAG v1 benchmark, derived from BIRD, includes 80 queries requiring either world knowledge or semantic reasoning beyond explicit database content. This approach aims to capture a broader spectrum of database-interaction tasks, highlighting the limitations of current methods and motivating new research directions.

Quick Start & Requirements

Install: pip install -r requirements.txt and pip install -e . within a conda environment (conda create -n tag python=3.10 -y).
Prerequisites: Python 3.10, conda, git, bash, and a language model server (LOTUS documentation for configuration). GPU is recommended for indexing.
Setup: Download databases (get_dbs.sh), create indexes (embed_all_dfs.sh), and generate Text2SQL prompts (get_text2sql_prompts.sh).
Links: LOTUS Documentation (implied for LM configuration).

Highlighted Details

Evaluates methods like hand-written TAG, Text2SQL, Text2SQL+LM, RAG, and RAG+LM.
Benchmark queries include match-based, comparison, ranking, and aggregation types.
40 queries require parametric knowledge, 40 require reasoning.
Analysis script (analyze.py) computes accuracy and latency.

Maintenance & Community

Project maintained by TAG-Research.
No explicit community links (Discord/Slack) or roadmap mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. The presence of requirements.txt and setup scripts suggests standard Python package compatibility.

Limitations & Caveats

The benchmark is an initial release (v1) and focuses on a subset of query types. Reproducing results requires configuring a specific language model server (LOTUS), which is not detailed within this README.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days