Table-Pretraining  by microsoft

Research paper for table pre-training via neural SQL execution

created 4 years ago
296 stars

Top 90.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

TAPEX is a pre-training approach designed to imbue generative language models with table reasoning capabilities. It targets researchers and practitioners working with structured data, offering state-of-the-art performance on table-based question answering tasks by learning to execute SQL queries over tables.

How It Works

TAPEX trains a model to mimic the process of executing SQL queries against a table. This is achieved by synthesizing a large corpus of (SQL query, flattened table, SQL execution result) tuples. The core idea is that by learning to faithfully execute SQL, the model develops a deep understanding of table structures and gains an inductive bias for reasoning over them. This approach allows for systematic generation of diverse and high-quality pre-training data.

Quick Start & Requirements

  • Install: pip install --editable ./ (within a Python 3.8 virtual environment).
  • Prerequisites: fairseq (version >= 0.12.0), Python 3.8 recommended.
  • Resources: Pre-trained models and preprocessed datasets are available for download. The pre-training corpus contains ~5 million examples.
  • Docs: Huggingface Transformers integration

Highlighted Details

  • Achieves SOTA performance on WikiSQL, SQA, and WikiTableQuestions benchmarks.
  • Supports fine-tuning via Huggingface Transformers library.
  • Includes code for synthesizing custom pre-training data using SQL templates.
  • Offers pre-trained models (tapex.base, tapex.large) and fine-tuned weights for various datasets.

Maintenance & Community

  • The project is associated with the ICLR 2022 paper "TAPEX: Table Pre-training via Learning a Neural SQL Executor".
  • Active updates noted through 2022, including Huggingface integration.

Licensing & Compatibility

  • Code & Models: MIT License.
  • Pre-training Corpus: CC BY-SA 4.0.
  • The MIT license permits commercial use and linking with closed-source projects.

Limitations & Caveats

  • The tapex-large model experienced a bug related to bart-large which may affect performance.
  • Fairseq dependency can be challenging for beginners.
Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.