Synthetic data SDK for tabular/language data generation
Top 54.3% on sourcepulse
This Python SDK provides a toolkit for generating high-fidelity, privacy-safe synthetic tabular and language data. It targets data scientists and engineers needing to create realistic datasets for testing, development, or privacy-preserving analytics, offering both local and cloud-based generation capabilities.
How It Works
The SDK utilizes a modular approach, allowing users to train generators on their data, create synthetic datasets from these generators, and connect to various data sources. It features state-of-the-art models like TabularARGN for tabular data, offering significant efficiency gains, and supports fine-tuning Hugging Face language models or using custom LSTMs for text synthesis.
Quick Start & Requirements
uv pip install -U 'mostlyai[local-gpu]'
(recommended for GPU) or uv pip install -U mostlyai
(client-only).databricks
, googlebigquery
, hive
, mssql
, mysql
, oracle
, postgres
, snowflake
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
mostlyai[local]
) may downgrade NumPy to 1.26 due to Opacus dependency.2 days ago
Inactive