TabLLM  by clinicalml

Research paper code for few-shot tabular data classification using LLMs

created 2 years ago
316 stars

Top 86.7% on sourcepulse

GitHubView on GitHub
Project Summary

TabLLM addresses the challenge of few-shot classification for tabular data by leveraging Large Language Models (LLMs). It targets researchers and practitioners in machine learning and natural language processing who need to perform classification tasks on structured datasets with limited labeled examples. The primary benefit is enabling effective classification without extensive task-specific fine-tuning or large labeled datasets.

How It Works

TabLLM converts tabular data into textual representations, allowing LLMs to process and classify it. This approach utilizes a "Text serialization" method, encoding each row as a text string with prompts, which proved most effective in experiments. The system then employs the t-few codebase for parameter-efficient fine-tuning of LLMs on these serialized datasets, enabling few-shot learning.

Quick Start & Requirements

  • Environment Setup: Requires Conda with Python 3.8 for TabLLM and Python 3.7 for the t-few project.
  • Dependencies: PyTorch 1.10.1 (CUDA 11.3), Hugging Face transformers, datasets, sentencepiece, protobuf, xgboost, lightgbm, tabpfn, fsspec, urllib3, importlib-metadata, scikit-learn.
  • Data Serialization: A script create_external_datasets.py is provided to serialize nine public datasets.
  • Model Training: Requires copying modified t-few files into the t-few repository and running shell scripts like ./bin/few-shot-pretrained-100k.sh.
  • Reproducing Results: Detailed instructions and example commands are provided for reproducing specific experimental results, including output parsing.

Highlighted Details

  • Achieves competitive few-shot classification performance on tabular data using LLMs.
  • "Text serialization" method demonstrated superior results compared to other serialization techniques.
  • Relies on the t-few project for model training and evaluation, with provided modifications.
  • Includes scripts to reproduce experimental results and generate summary tables.

Maintenance & Community

The project is associated with authors from institutions like MIT and is part of the PMLR proceedings. It cites the t-few, PromptSource, and a NeurIPS paper, indicating a connection to established research efforts. No specific community channels (Discord/Slack) or active maintenance signals are explicitly mentioned in the README.

Licensing & Compatibility

The repository does not explicitly state a license. However, it heavily relies on the t-few project, which is typically under a permissive license (e.g., MIT). Compatibility for commercial use would depend on the licenses of all dependencies and the underlying LLMs used.

Limitations & Caveats

The code for handling private healthcare datasets and some additional experiments is not included due to privacy concerns. Users may encounter dependency issues when setting up the t-few environment, requiring careful adherence to the provided commands. Path configuration is critical and may need significant adaptation for different user setups.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
19 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.