be_great by tabularis-ai

Framework for synthetic tabular data generation (research paper)

Created 3 years ago

339 stars

Top 81.4% on SourcePulse

Project Summary

GReaT is a Python framework for synthesizing realistic tabular data using pretrained Transformer language models. It is designed for researchers and data scientists needing to generate synthetic datasets for privacy, augmentation, or testing purposes, offering a user-friendly API for data generation and imputation.

How It Works

GReaT leverages pretrained Transformer models, like GPT-2 variants, to learn the underlying distribution of tabular data. It treats each row as a sequence, encoding categorical and numerical features into a format suitable for language models. This approach allows for capturing complex inter-column dependencies and generating novel, realistic data samples.

Quick Start & Requirements

Install via pip: pip install be-great
Requires Python >= 3.9.
Supports fp16=True for faster training on compatible hardware.
Example usage and imputation code are provided in the README.
Publication details are available via the provided citation link.

Highlighted Details

Novel approach for tabular data synthesis using LLMs.
Supports data imputation for filling missing values.
Includes methods for saving and loading model checkpoints, with S3 support.
Based on the HuggingFace Transformers library.

Maintenance & Community

The project is associated with Vadim Borisov, Kathrin Sessler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci, with a publication from the Eleventh International Conference on Learning Representations.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The framework's performance and the quality of synthetic data are dependent on the chosen pretrained language model and the complexity of the input tabular data. Specific hardware requirements for training larger models are not detailed.

be_great by tabularis-ai

Explore Similar Projects

ProX by GAIR-NLP

alpaca-chinese-dataset by carbonz0

FlagData by FlagOpen

DataDesigner by NVIDIA-NeMo

DataDreamer by datadreamer-dev

mostlyai by mostly-ai

OpenCoder-llm by OpenCoder-llm

curator by bespokelabsai

automl-gs by minimaxir

data-prep-kit by data-prep-kit

Kiln by Kiln-AI

synthetic-data-generator by hitsz-ids