Tabular-data-generation  by Diyago

Tabular data generation via GANs, diffusion, and LLMs

Created 5 years ago
562 stars

Top 57.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This library provides tools for generating synthetic tabular data using various generative models, including GANs, TimeGANs, Diffusion models, and LLMs. It aims to improve dataset quality and facilitate machine learning workflows by offering high-fidelity synthetic data generation for researchers and data scientists.

How It Works

The library offers multiple samplers: OriginalGenerator, GANGenerator (based on CTGAN), ForestDiffusionGenerator (based on Forest Diffusion), and LLMGenerator (based on the GReaT framework). These models can generate synthetic data that mimics the statistical properties of the original dataset. The generate_data_pipe method handles data preprocessing, generation, and optional adversarial filtering for quality assurance.

Quick Start & Requirements

  • Install via pip: pip install tabgan
  • Requires pandas and numpy.
  • Supports continuous and discrete (categorical) columns. Numerical columns are treated as floats; integer requirements necessitate post-generation rounding.
  • Example usage and detailed parameter explanations are available in the README.

Highlighted Details

  • Supports GANs, TimeGANs, Diffusion, and LLM-based generation.
  • Includes post-processing and adversarial filtering for data quality.
  • Offers utilities for time-series data generation by extracting date components.
  • Provides a compare_dataframes function for evaluating generated data quality.

Maintenance & Community

The project is associated with an arXiv article and a Medium post. Further details on community or active maintenance are not explicitly stated in the README.

Licensing & Compatibility

The README does not explicitly state a license. However, the project cites research papers, some of which may have their own licenses. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The README mentions that the output data size might be less than requested due to post-processing and adversarial filtering. It also notes that for integer requirements, rounding must be performed outside the library. The project appears to be research-oriented, and its stability for production environments is not detailed.

Health Check
Last Commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luca Antiga Luca Antiga(CTO of Lightning AI), and
2 more.

mmagic by open-mmlab

0.1%
7k
AIGC toolbox for image/video editing and generation
Created 6 years ago
Updated 1 year ago
Starred by Ross Taylor Ross Taylor(Cofounder of General Reasoning; Cocreator of Papers with Code), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
6 more.

AdversarialNetsPapers by zhangqianhui

0%
7k
Paper list for generative adversarial networks (GANs)
Created 9 years ago
Updated 3 years ago
Feedback? Help us improve.