tab-ddpm by yandex-research

Research paper implementation for tabular data generation via diffusion models

Created 3 years ago

522 stars

Top 60.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

This repository provides the official implementation for "TabDDPM: Modelling Tabular Data with Diffusion Models," an ICML 2023 paper. It offers a diffusion model-based approach for generating synthetic tabular data, targeting researchers and practitioners in machine learning and data science who need robust synthetic data generation capabilities. The benefit lies in leveraging diffusion models for potentially higher quality and more diverse synthetic tabular data compared to traditional methods.

How It Works

TabDDPM utilizes a diffusion probabilistic model adapted for tabular data. The core idea involves a forward diffusion process that gradually adds noise to the tabular data and a reverse denoising process that learns to remove this noise, thereby generating new data samples. This approach aims to capture complex data distributions and dependencies within tabular datasets more effectively than GAN-based or VAE-based methods.

Quick Start & Requirements

Install: Use Conda for environment management.

conda create -n tddpm python=3.9.7
conda activate tddpm
pip install torch==1.10.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

Prerequisites: Python 3.9.7, PyTorch 1.10.1 with CUDA 11.1.
Data: Download datasets from the provided Dropbox link and extract.
Setup Time: Approximately 7 minutes on an NVIDIA GeForce RTX 2080 Ti for a sample pipeline run.
Docs: CONFIG_DESCRIPTION.md for configuration details.

Highlighted Details

Implements TabDDPM, a diffusion model for tabular data generation.
Includes scripts for training, sampling, evaluation, and hyperparameter tuning.
Supports evaluation against baselines like SMOTE, CTGAN, TVAE, and CTAB-GAN.
Provides scripts for evaluating synthetic data quality using CatBoost or MLP models.

Maintenance & Community

Developed by Yandex Research.
No explicit community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The data sources are subject to their original licenses, with no additional restrictions imposed by the repository.

Limitations & Caveats

The repository requires a specific older version of PyTorch (1.10.1) with CUDA 11.1, which may pose compatibility challenges with newer hardware or software stacks. The lack of explicit licensing information could be a concern for commercial use.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days