Tabular-data-generation by Diyago

Tabular data generation via GANs, diffusion, and LLMs

Created 5 years ago

562 stars

Top 57.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This library provides tools for generating synthetic tabular data using various generative models, including GANs, TimeGANs, Diffusion models, and LLMs. It aims to improve dataset quality and facilitate machine learning workflows by offering high-fidelity synthetic data generation for researchers and data scientists.

How It Works

The library offers multiple samplers: OriginalGenerator, GANGenerator (based on CTGAN), ForestDiffusionGenerator (based on Forest Diffusion), and LLMGenerator (based on the GReaT framework). These models can generate synthetic data that mimics the statistical properties of the original dataset. The generate_data_pipe method handles data preprocessing, generation, and optional adversarial filtering for quality assurance.

Quick Start & Requirements

Install via pip: pip install tabgan
Requires pandas and numpy.
Supports continuous and discrete (categorical) columns. Numerical columns are treated as floats; integer requirements necessitate post-generation rounding.
Example usage and detailed parameter explanations are available in the README.

Highlighted Details

Supports GANs, TimeGANs, Diffusion, and LLM-based generation.
Includes post-processing and adversarial filtering for data quality.
Offers utilities for time-series data generation by extracting date components.
Provides a compare_dataframes function for evaluating generated data quality.

Maintenance & Community

The project is associated with an arXiv article and a Medium post. Further details on community or active maintenance are not explicitly stated in the README.

Licensing & Compatibility

The README does not explicitly state a license. However, the project cites research papers, some of which may have their own licenses. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The README mentions that the output data size might be less than requested due to post-processing and adversarial filtering. It also notes that for integer requirements, rounding must be performed outside the library. The project appears to be research-oriented, and its stability for production environments is not detailed.

Health Check

Last Commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days