be_great  by tabularis-ai

Framework for synthetic tabular data generation (research paper)

created 2 years ago
317 stars

Top 86.5% on sourcepulse

GitHubView on GitHub
Project Summary

GReaT is a Python framework for synthesizing realistic tabular data using pretrained Transformer language models. It is designed for researchers and data scientists needing to generate synthetic datasets for privacy, augmentation, or testing purposes, offering a user-friendly API for data generation and imputation.

How It Works

GReaT leverages pretrained Transformer models, like GPT-2 variants, to learn the underlying distribution of tabular data. It treats each row as a sequence, encoding categorical and numerical features into a format suitable for language models. This approach allows for capturing complex inter-column dependencies and generating novel, realistic data samples.

Quick Start & Requirements

  • Install via pip: pip install be-great
  • Requires Python >= 3.9.
  • Supports fp16=True for faster training on compatible hardware.
  • Example usage and imputation code are provided in the README.
  • Publication details are available via the provided citation link.

Highlighted Details

  • Novel approach for tabular data synthesis using LLMs.
  • Supports data imputation for filling missing values.
  • Includes methods for saving and loading model checkpoints, with S3 support.
  • Based on the HuggingFace Transformers library.

Maintenance & Community

The project is associated with Vadim Borisov, Kathrin Sessler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci, with a publication from the Eleventh International Conference on Learning Representations.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The framework's performance and the quality of synthetic data are dependent on the chosen pretrained language model and the complexity of the input tabular data. Specific hardware requirements for training larger models are not detailed.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.