Discover and explore top open-source AI tools and projects—updated daily.
mostly-aiSynthetic data SDK for tabular/language data generation
Top 49.8% on SourcePulse
This Python SDK provides a toolkit for generating high-fidelity, privacy-safe synthetic tabular and language data. It targets data scientists and engineers needing to create realistic datasets for testing, development, or privacy-preserving analytics, offering both local and cloud-based generation capabilities.
How It Works
The SDK utilizes a modular approach, allowing users to train generators on their data, create synthetic datasets from these generators, and connect to various data sources. It features state-of-the-art models like TabularARGN for tabular data, offering significant efficiency gains, and supports fine-tuning Hugging Face language models or using custom LSTMs for text synthesis.
Quick Start & Requirements
uv pip install -U 'mostlyai[local-gpu]' (recommended for GPU) or uv pip install -U mostlyai (client-only).databricks, googlebigquery, hive, mssql, mysql, oracle, postgres, snowflake.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
mostlyai[local]) may downgrade NumPy to 1.26 due to Opacus dependency.5 hours ago
1 day
datadreamer-dev
minimaxir