TabLLM by clinicalml

Research paper code for few-shot tabular data classification using LLMs

Created 3 years ago

332 stars

Top 82.7% on SourcePulse

Project Summary

TabLLM addresses the challenge of few-shot classification for tabular data by leveraging Large Language Models (LLMs). It targets researchers and practitioners in machine learning and natural language processing who need to perform classification tasks on structured datasets with limited labeled examples. The primary benefit is enabling effective classification without extensive task-specific fine-tuning or large labeled datasets.

How It Works

TabLLM converts tabular data into textual representations, allowing LLMs to process and classify it. This approach utilizes a "Text serialization" method, encoding each row as a text string with prompts, which proved most effective in experiments. The system then employs the t-few codebase for parameter-efficient fine-tuning of LLMs on these serialized datasets, enabling few-shot learning.

Quick Start & Requirements

Environment Setup: Requires Conda with Python 3.8 for TabLLM and Python 3.7 for the t-few project.
Dependencies: PyTorch 1.10.1 (CUDA 11.3), Hugging Face transformers, datasets, sentencepiece, protobuf, xgboost, lightgbm, tabpfn, fsspec, urllib3, importlib-metadata, scikit-learn.
Data Serialization: A script create_external_datasets.py is provided to serialize nine public datasets.
Model Training: Requires copying modified t-few files into the t-few repository and running shell scripts like ./bin/few-shot-pretrained-100k.sh.
Reproducing Results: Detailed instructions and example commands are provided for reproducing specific experimental results, including output parsing.

Highlighted Details

Achieves competitive few-shot classification performance on tabular data using LLMs.
"Text serialization" method demonstrated superior results compared to other serialization techniques.
Relies on the t-few project for model training and evaluation, with provided modifications.
Includes scripts to reproduce experimental results and generate summary tables.

Maintenance & Community

The project is associated with authors from institutions like MIT and is part of the PMLR proceedings. It cites the t-few, PromptSource, and a NeurIPS paper, indicating a connection to established research efforts. No specific community channels (Discord/Slack) or active maintenance signals are explicitly mentioned in the README.

Licensing & Compatibility

The repository does not explicitly state a license. However, it heavily relies on the t-few project, which is typically under a permissive license (e.g., MIT). Compatibility for commercial use would depend on the licenses of all dependencies and the underlying LLMs used.

Limitations & Caveats

The code for handling private healthcare datasets and some additional experiments is not included due to privacy concerns. Users may encounter dependency issues when setting up the t-few environment, requiring careful adherence to the provided commands. Path configuration is critical and may need significant adaptation for different user setups.

TabLLM by clinicalml

Explore Similar Projects

ToolQA by night-chen

ToolkenGPT by Ber666

evaporate by HazyResearch

OpenLTM by thuml

awesome-instruction-dataset by yaodongC

seqio by google

dclm by mlfoundations

Awesome-LLMs-Datasets by lmmlzn

AutoDL by DeepWisdom

LLMDataHub by Zjh-819

augmentoolkit by e-p-armstrong

llm-datasets by mlabonne