synthetic-data-generator  by hitsz-ids

Framework for generating high-quality structured tabular data

created 2 years ago
2,374 stars

Top 19.8% on sourcepulse

GitHubView on GitHub
Project Summary

This framework generates high-quality structured tabular synthetic data, suitable for data sharing, model training, and system testing, while preserving data characteristics without sensitive information. It targets data scientists and engineers needing privacy-preserving data solutions.

How It Works

SDG integrates multiple statistical and LLM-based synthesis algorithms, including CTGAN for billion-level data and a novel LLM model for zero-shot generation and off-table feature inference. A Data Processor module handles pre- and post-processing for various data types, null values, and custom transformations, enhancing data quality and model compatibility.

Quick Start & Requirements

  • Install via Docker (docker pull idsteam/sdgx:latest) or pip (pip install sdgx).
  • Local installation from source is recommended.
  • Demo code available for single-table generation and metrics.
  • See Colab examples for LLM integration and large-scale CTGAN.

Highlighted Details

  • Supports CTGAN for billion-level data with reduced memory consumption compared to SDV.
  • Integrates LLM-based models for synthetic data generation without training data and for off-table feature inference.
  • Data Processor module handles complex data type conversions, null values, and plug-in extensions.
  • Offers privacy enhancements like differential privacy and anonymization.

Maintenance & Community

  • Initiated by the Institute of Data Security, Harbin Institute of Technology.
  • Active development with recent updates in November 2024 and May 2024.
  • Community engagement encouraged via GitHub issues, PRs, and a WeChat group.

Licensing & Compatibility

  • Licensed under Apache-2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is actively developed, with features like LLM integration and advanced data processing being recent additions. Users should refer to the latest documentation and examples for optimal usage and potential evolving capabilities.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alexander Wettig Alexander Wettig(Author of SWE-bench, SWE-agent), and
2 more.

data-juicer by modelscope

0.7%
5k
Data-Juicer: Data processing system for foundation models
created 2 years ago
updated 1 day ago
Feedback? Help us improve.