Tabular data generation via GANs, diffusion, and LLMs
Top 58.4% on sourcepulse
This library provides tools for generating synthetic tabular data using various generative models, including GANs, TimeGANs, Diffusion models, and LLMs. It aims to improve dataset quality and facilitate machine learning workflows by offering high-fidelity synthetic data generation for researchers and data scientists.
How It Works
The library offers multiple samplers: OriginalGenerator
, GANGenerator
(based on CTGAN), ForestDiffusionGenerator
(based on Forest Diffusion), and LLMGenerator
(based on the GReaT framework). These models can generate synthetic data that mimics the statistical properties of the original dataset. The generate_data_pipe
method handles data preprocessing, generation, and optional adversarial filtering for quality assurance.
Quick Start & Requirements
pip install tabgan
Highlighted Details
compare_dataframes
function for evaluating generated data quality.Maintenance & Community
The project is associated with an arXiv article and a Medium post. Further details on community or active maintenance are not explicitly stated in the README.
Licensing & Compatibility
The README does not explicitly state a license. However, the project cites research papers, some of which may have their own licenses. Users should verify licensing for commercial or closed-source use.
Limitations & Caveats
The README mentions that the output data size might be less than requested due to post-processing and adversarial filtering. It also notes that for integer requirements, rounding must be performed outside the library. The project appears to be research-oriented, and its stability for production environments is not detailed.
1 month ago
1 week