synthetic-data-generator by hitsz-ids

Framework for generating high-quality structured tabular data

Created 2 years ago

2,407 stars

Top 18.7% on SourcePulse

Project Summary

This framework generates high-quality structured tabular synthetic data, suitable for data sharing, model training, and system testing, while preserving data characteristics without sensitive information. It targets data scientists and engineers needing privacy-preserving data solutions.

How It Works

SDG integrates multiple statistical and LLM-based synthesis algorithms, including CTGAN for billion-level data and a novel LLM model for zero-shot generation and off-table feature inference. A Data Processor module handles pre- and post-processing for various data types, null values, and custom transformations, enhancing data quality and model compatibility.

Quick Start & Requirements

Install via Docker (docker pull idsteam/sdgx:latest) or pip (pip install sdgx).
Local installation from source is recommended.
Demo code available for single-table generation and metrics.
See Colab examples for LLM integration and large-scale CTGAN.

Highlighted Details

Supports CTGAN for billion-level data with reduced memory consumption compared to SDV.
Integrates LLM-based models for synthetic data generation without training data and for off-table feature inference.
Data Processor module handles complex data type conversions, null values, and plug-in extensions.
Offers privacy enhancements like differential privacy and anonymization.

Maintenance & Community

Initiated by the Institute of Data Security, Harbin Institute of Technology.
Active development with recent updates in November 2024 and May 2024.
Community engagement encouraged via GitHub issues, PRs, and a WeChat group.

Licensing & Compatibility

Licensed under Apache-2.0.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is actively developed, with features like LLM integration and advanced data processing being recent additions. Users should refer to the latest documentation and examples for optimal usage and potential evolving capabilities.

synthetic-data-generator by hitsz-ids

Explore Similar Projects

ProX by GAIR-NLP

alpaca-chinese-dataset by carbonz0

OmniSQL by RUCKBReasoning

FlagData by FlagOpen

be_great by tabularis-ai

DataDesigner by NVIDIA-NeMo

mostlyai by mostly-ai

synthcity by vanderschaarlab

persona-hub by tencent-ailab

OpenCoder-llm by OpenCoder-llm

tab-ddpm by yandex-research

curator by bespokelabsai