mostlyai  by mostly-ai

Synthetic data SDK for tabular/language data generation

created 1 year ago
616 stars

Top 54.3% on sourcepulse

GitHubView on GitHub
Project Summary

This Python SDK provides a toolkit for generating high-fidelity, privacy-safe synthetic tabular and language data. It targets data scientists and engineers needing to create realistic datasets for testing, development, or privacy-preserving analytics, offering both local and cloud-based generation capabilities.

How It Works

The SDK utilizes a modular approach, allowing users to train generators on their data, create synthetic datasets from these generators, and connect to various data sources. It features state-of-the-art models like TabularARGN for tabular data, offering significant efficiency gains, and supports fine-tuning Hugging Face language models or using custom LSTMs for text synthesis.

Quick Start & Requirements

  • Install via pip: uv pip install -U 'mostlyai[local-gpu]' (recommended for GPU) or uv pip install -U mostlyai (client-only).
  • Python 3.10+ required.
  • GPU highly recommended for language models.
  • Optional extras for data connectors: databricks, googlebigquery, hive, mssql, mysql, oracle, postgres, snowflake.
  • Docker support available for isolated environments.
  • Documentation: Usage Examples

Highlighted Details

  • Broad data support: mixed-type, single/multi-table, time-series, geospatial, text.
  • Advanced training: GPU/CPU, differential privacy, progress monitoring, automated QA reports.
  • Flexible sampling: up-sampling, conditional generation, re-balancing, rule-adherence.
  • TabularARGN models offer 1-2 orders of magnitude efficiency improvement.

Maintenance & Community

  • Developed by MOSTLY AI.
  • Citation available via BibTeX.

Licensing & Compatibility

  • Fully permissive open-source license.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Installing local extras (e.g., mostlyai[local]) may downgrade NumPy to 1.26 due to Opacus dependency.
  • A runtime restart is needed after installing local extras in environments like Google Colab.
Health Check
Last commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
31
Issues (30d)
4
Star History
148 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 4 days ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), and
7 more.

mindsdb by mindsdb

0.5%
35k
AI query engine for federated data sources
created 7 years ago
updated 1 day ago
Feedback? Help us improve.