DataDesigner  by NVIDIA-NeMo

High-quality synthetic data generation library

Created 2 months ago
620 stars

Top 53.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

NVIDIA-NeMo/DataDesigner is a library designed for generating high-quality synthetic datasets, addressing limitations of basic LLM prompting. It targets engineers and researchers needing production-grade data, offering a flexible framework to create diverse, statistically sound, and validated datasets from scratch or seed data.

How It Works

The framework employs a hybrid approach, leveraging statistical samplers, large language models (LLMs), and existing seed datasets for data generation. A key design choice is its dependency-aware generation capability, allowing precise control over relationships between data fields. Quality is ensured through built-in validation mechanisms, supporting Python, SQL, and custom local/remote validators, alongside LLM-as-a-judge for output scoring. This combination provides a robust and flexible system for creating sophisticated synthetic data.

Quick Start & Requirements

Installation is straightforward via pip: pip install data-designer. Alternatively, users can install from source by cloning the repository and running make install. A critical prerequisite is setting an API key for either NVIDIA (NVIDIA_API_KEY) or OpenAI (OPENAI_API_KEY). The project offers detailed documentation, including a Quick Start Guide and Tutorial Notebooks, accessible via the GitHub repository at https://github.com/NVIDIA-NeMo/DataDesigner.

Highlighted Details

  • Supports data generation via statistical samplers, LLMs, or seed datasets.
  • Enables control over inter-field relationships through dependency-aware generation.
  • Features built-in validation using Python, SQL, and custom local/remote validators.
  • Incorporates LLM-as-a-judge for scoring generated data quality.
  • Includes a preview mode for rapid iteration before full-scale generation.

Maintenance & Community

The project is maintained by "The NeMo Data Designer Team." Community engagement is facilitated through GitHub Issues for bug reporting and feature requests. Specific links to community channels like Discord or Slack, or a public roadmap, are not detailed in the provided README.

Licensing & Compatibility

Data Designer is released under the permissive Apache License 2.0. This license generally allows for commercial use and integration into closed-source projects without significant restrictions, promoting broad adoption.

Limitations & Caveats

The provided README does not explicitly detail limitations, known bugs, or alpha status. However, the reliance on external LLM APIs for certain generation tasks implies potential costs and a dependency on the availability and performance of those services.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
42
Issues (30d)
33
Star History
199 stars in the last 30 days

Explore Similar Projects

Starred by Ross Taylor Ross Taylor(Cofounder of General Reasoning; Cocreator of Papers with Code), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
3 more.

curator by bespokelabsai

0.2%
2k
Synthetic data curation tool for post-training and structured data extraction
Created 1 year ago
Updated 6 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

0.3%
1k
Synthetic data CLI tool for LLM fine-tuning
Created 9 months ago
Updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
1 more.

Kiln by Kiln-AI

0.3%
5k
AI prototyping and dataset collaboration tool
Created 1 year ago
Updated 19 hours ago
Feedback? Help us improve.