Tabular-data-generation  by Diyago

Tabular data generation via GANs, diffusion, and LLMs

created 5 years ago
557 stars

Top 58.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This library provides tools for generating synthetic tabular data using various generative models, including GANs, TimeGANs, Diffusion models, and LLMs. It aims to improve dataset quality and facilitate machine learning workflows by offering high-fidelity synthetic data generation for researchers and data scientists.

How It Works

The library offers multiple samplers: OriginalGenerator, GANGenerator (based on CTGAN), ForestDiffusionGenerator (based on Forest Diffusion), and LLMGenerator (based on the GReaT framework). These models can generate synthetic data that mimics the statistical properties of the original dataset. The generate_data_pipe method handles data preprocessing, generation, and optional adversarial filtering for quality assurance.

Quick Start & Requirements

  • Install via pip: pip install tabgan
  • Requires pandas and numpy.
  • Supports continuous and discrete (categorical) columns. Numerical columns are treated as floats; integer requirements necessitate post-generation rounding.
  • Example usage and detailed parameter explanations are available in the README.

Highlighted Details

  • Supports GANs, TimeGANs, Diffusion, and LLM-based generation.
  • Includes post-processing and adversarial filtering for data quality.
  • Offers utilities for time-series data generation by extracting date components.
  • Provides a compare_dataframes function for evaluating generated data quality.

Maintenance & Community

The project is associated with an arXiv article and a Medium post. Further details on community or active maintenance are not explicitly stated in the README.

Licensing & Compatibility

The README does not explicitly state a license. However, the project cites research papers, some of which may have their own licenses. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The README mentions that the output data size might be less than requested due to post-processing and adversarial filtering. It also notes that for integer requirements, rounding must be performed outside the library. The project appears to be research-oriented, and its stability for production environments is not detailed.

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

1.6%
1k
Synthetic data CLI tool for LLM fine-tuning
created 4 months ago
updated 1 week ago
Feedback? Help us improve.