mostlyai  by mostly-ai

Synthetic data SDK for tabular/language data generation

Created 1 year ago
637 stars

Top 52.1% on SourcePulse

GitHubView on GitHub
Project Summary

This Python SDK provides a toolkit for generating high-fidelity, privacy-safe synthetic tabular and language data. It targets data scientists and engineers needing to create realistic datasets for testing, development, or privacy-preserving analytics, offering both local and cloud-based generation capabilities.

How It Works

The SDK utilizes a modular approach, allowing users to train generators on their data, create synthetic datasets from these generators, and connect to various data sources. It features state-of-the-art models like TabularARGN for tabular data, offering significant efficiency gains, and supports fine-tuning Hugging Face language models or using custom LSTMs for text synthesis.

Quick Start & Requirements

  • Install via pip: uv pip install -U 'mostlyai[local-gpu]' (recommended for GPU) or uv pip install -U mostlyai (client-only).
  • Python 3.10+ required.
  • GPU highly recommended for language models.
  • Optional extras for data connectors: databricks, googlebigquery, hive, mssql, mysql, oracle, postgres, snowflake.
  • Docker support available for isolated environments.
  • Documentation: Usage Examples

Highlighted Details

  • Broad data support: mixed-type, single/multi-table, time-series, geospatial, text.
  • Advanced training: GPU/CPU, differential privacy, progress monitoring, automated QA reports.
  • Flexible sampling: up-sampling, conditional generation, re-balancing, rule-adherence.
  • TabularARGN models offer 1-2 orders of magnitude efficiency improvement.

Maintenance & Community

  • Developed by MOSTLY AI.
  • Citation available via BibTeX.

Licensing & Compatibility

  • Fully permissive open-source license.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Installing local extras (e.g., mostlyai[local]) may downgrade NumPy to 1.26 due to Opacus dependency.
  • A runtime restart is needed after installing local extras in environments like Google Colab.
Health Check
Last Commit

14 hours ago

Responsiveness

1 day

Pull Requests (30d)
22
Issues (30d)
3
Star History
20 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.