mostlyai by mostly-ai

Synthetic data SDK for tabular/language data generation

Created 2 years ago

699 stars

Top 48.9% on SourcePulse

2 Experts Love This Project

luiscape

Cofounder of Lightning AI

ebursztein

Cybersecurity Lead at Google DeepMind

Project Summary

This Python SDK provides a toolkit for generating high-fidelity, privacy-safe synthetic tabular and language data. It targets data scientists and engineers needing to create realistic datasets for testing, development, or privacy-preserving analytics, offering both local and cloud-based generation capabilities.

How It Works

The SDK utilizes a modular approach, allowing users to train generators on their data, create synthetic datasets from these generators, and connect to various data sources. It features state-of-the-art models like TabularARGN for tabular data, offering significant efficiency gains, and supports fine-tuning Hugging Face language models or using custom LSTMs for text synthesis.

Quick Start & Requirements

Install via pip: uv pip install -U 'mostlyai[local-gpu]' (recommended for GPU) or uv pip install -U mostlyai (client-only).
Python 3.10+ required.
GPU highly recommended for language models.
Optional extras for data connectors: databricks, googlebigquery, hive, mssql, mysql, oracle, postgres, snowflake.
Docker support available for isolated environments.
Documentation: Usage Examples

Highlighted Details

Broad data support: mixed-type, single/multi-table, time-series, geospatial, text.
Advanced training: GPU/CPU, differential privacy, progress monitoring, automated QA reports.
Flexible sampling: up-sampling, conditional generation, re-balancing, rule-adherence.
TabularARGN models offer 1-2 orders of magnitude efficiency improvement.

Maintenance & Community

Developed by MOSTLY AI.
Citation available via BibTeX.

Licensing & Compatibility

Fully permissive open-source license.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

Installing local extras (e.g., mostlyai[local]) may downgrade NumPy to 1.26 due to Opacus dependency.
A runtime restart is needed after installing local extras in environments like Google Colab.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

6

Issues (30d)

0

Star History

9 stars in the last 30 days

Explore Similar Projects

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

awesome-synthetic-datasets by davanstrien

Curated list of synthetic text/vision datasets and generation tools

Created 1 year ago

Updated 3 days ago

alpaca-chinese-dataset by carbonz0

Chinese instruction fine-tuning dataset

Created 2 years ago

Updated 2 years ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

loong by camel-ai

Synthetic data generation project using LLM agents

Created 9 months ago

Updated 2 days ago

OmniSQL by RUCKBReasoning

Text-to-SQL models and dataset for cross-domain applications

Created 10 months ago

Updated 4 months ago

Starred by

Luca Soldaini

Luca Soldaini(Research Scientist at Ai2),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

4 more.

DataDreamer by datadreamer-dev

Python library for synthetic data generation and training workflows

Created 2 years ago

Updated 11 months ago

be_great by tabularis-ai

Framework for synthetic tabular data generation (research paper)

Created 3 years ago

Updated 1 month ago

synthetic-data-generator by argilla-io

Synthetic data generator for language models

Created 1 year ago

Updated 3 months ago

Starred by

Casper Hansen

Casper Hansen(Author of AutoAWQ) and

Maxime Labonne

Maxime Labonne(Head of Post-Training at Liquid AI).

OpenCoder-llm by OpenCoder-llm

Open code LLM family (1.5B/8B) for English and Chinese

Created 1 year ago

Updated 1 year ago

persona-hub by tencent-ailab

Synthetic data creation research paper using 1B personas

Created 1 year ago

Updated 10 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

tab-ddpm by yandex-research

Research paper implementation for tabular data generation via diffusion models

Created 3 years ago

Updated 1 year ago

Starred by

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face),

Brendan Falk

Brendan Falk(Cofounder of Fig), and

1 more.

aitextgen by minimaxir

Python tool for text-based AI training and generation

Created 6 years ago

Updated 2 years ago

synthetic-data-generator by hitsz-ids

Framework for generating high-quality structured tabular data

Created 2 years ago

Updated 3 weeks ago

Feedback? Help us improve.