tab-ddpm  by yandex-research

Research paper implementation for tabular data generation via diffusion models

created 2 years ago
472 stars

Top 65.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides the official implementation for "TabDDPM: Modelling Tabular Data with Diffusion Models," an ICML 2023 paper. It offers a diffusion model-based approach for generating synthetic tabular data, targeting researchers and practitioners in machine learning and data science who need robust synthetic data generation capabilities. The benefit lies in leveraging diffusion models for potentially higher quality and more diverse synthetic tabular data compared to traditional methods.

How It Works

TabDDPM utilizes a diffusion probabilistic model adapted for tabular data. The core idea involves a forward diffusion process that gradually adds noise to the tabular data and a reverse denoising process that learns to remove this noise, thereby generating new data samples. This approach aims to capture complex data distributions and dependencies within tabular datasets more effectively than GAN-based or VAE-based methods.

Quick Start & Requirements

  • Install: Use Conda for environment management.
    conda create -n tddpm python=3.9.7
    conda activate tddpm
    pip install torch==1.10.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
    pip install -r requirements.txt
    
  • Prerequisites: Python 3.9.7, PyTorch 1.10.1 with CUDA 11.1.
  • Data: Download datasets from the provided Dropbox link and extract.
  • Setup Time: Approximately 7 minutes on an NVIDIA GeForce RTX 2080 Ti for a sample pipeline run.
  • Docs: CONFIG_DESCRIPTION.md for configuration details.

Highlighted Details

  • Implements TabDDPM, a diffusion model for tabular data generation.
  • Includes scripts for training, sampling, evaluation, and hyperparameter tuning.
  • Supports evaluation against baselines like SMOTE, CTGAN, TVAE, and CTAB-GAN.
  • Provides scripts for evaluating synthetic data quality using CatBoost or MLP models.

Maintenance & Community

  • Developed by Yandex Research.
  • No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The data sources are subject to their original licenses, with no additional restrictions imposed by the repository.

Limitations & Caveats

The repository requires a specific older version of PyTorch (1.10.1) with CUDA 11.1, which may pose compatibility challenges with newer hardware or software stacks. The lack of explicit licensing information could be a concern for commercial use.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
29 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.