Framework for synthetic tabular data generation (research paper)
Top 86.5% on sourcepulse
GReaT is a Python framework for synthesizing realistic tabular data using pretrained Transformer language models. It is designed for researchers and data scientists needing to generate synthetic datasets for privacy, augmentation, or testing purposes, offering a user-friendly API for data generation and imputation.
How It Works
GReaT leverages pretrained Transformer models, like GPT-2 variants, to learn the underlying distribution of tabular data. It treats each row as a sequence, encoding categorical and numerical features into a format suitable for language models. This approach allows for capturing complex inter-column dependencies and generating novel, realistic data samples.
Quick Start & Requirements
pip install be-great
fp16=True
for faster training on compatible hardware.Highlighted Details
Maintenance & Community
The project is associated with Vadim Borisov, Kathrin Sessler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci, with a publication from the Eleventh International Conference on Learning Representations.
Licensing & Compatibility
The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The framework's performance and the quality of synthetic data are dependent on the chosen pretrained language model and the complexity of the input tabular data. Specific hardware requirements for training larger models are not detailed.
1 month ago
1 day