DataDesigner by NVIDIA-NeMo

High-quality synthetic data generation library

Created 4 months ago

744 stars

Top 46.6% on SourcePulse

Project Summary

Summary

NVIDIA-NeMo/DataDesigner is a library designed for generating high-quality synthetic datasets, addressing limitations of basic LLM prompting. It targets engineers and researchers needing production-grade data, offering a flexible framework to create diverse, statistically sound, and validated datasets from scratch or seed data.

How It Works

The framework employs a hybrid approach, leveraging statistical samplers, large language models (LLMs), and existing seed datasets for data generation. A key design choice is its dependency-aware generation capability, allowing precise control over relationships between data fields. Quality is ensured through built-in validation mechanisms, supporting Python, SQL, and custom local/remote validators, alongside LLM-as-a-judge for output scoring. This combination provides a robust and flexible system for creating sophisticated synthetic data.

Quick Start & Requirements

Installation is straightforward via pip: pip install data-designer. Alternatively, users can install from source by cloning the repository and running make install. A critical prerequisite is setting an API key for either NVIDIA (NVIDIA_API_KEY) or OpenAI (OPENAI_API_KEY). The project offers detailed documentation, including a Quick Start Guide and Tutorial Notebooks, accessible via the GitHub repository at https://github.com/NVIDIA-NeMo/DataDesigner.

Highlighted Details

Supports data generation via statistical samplers, LLMs, or seed datasets.
Enables control over inter-field relationships through dependency-aware generation.
Features built-in validation using Python, SQL, and custom local/remote validators.
Incorporates LLM-as-a-judge for scoring generated data quality.
Includes a preview mode for rapid iteration before full-scale generation.

Maintenance & Community

The project is maintained by "The NeMo Data Designer Team." Community engagement is facilitated through GitHub Issues for bug reporting and feature requests. Specific links to community channels like Discord or Slack, or a public roadmap, are not detailed in the provided README.

Licensing & Compatibility

Data Designer is released under the permissive Apache License 2.0. This license generally allows for commercial use and integration into closed-source projects without significant restrictions, promoting broad adoption.

Limitations & Caveats

The provided README does not explicitly detail limitations, known bugs, or alpha status. However, the reliance on external LLM APIs for certain generation tasks implies potential costs and a dependency on the availability and performance of those services.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

86 stars in the last 30 days