DataDreamer by datadreamer-dev

Python library for synthetic data generation and training workflows

Created 2 years ago

1,086 stars

Top 35.0% on SourcePulse

View on GitHub

6 Experts Love This Project

Luca Soldaini

Research Scientist at Ai2

Jeff Hammerbacher

Cofounder of Cloudera

Elvis Saravia

Founder of DAIR.AI

Junyang Lin

Core Maintainer at Alibaba Qwen

and 2 more!

Project Summary

DataDreamer is a Python library for creating and executing complex LLM workflows, focusing on synthetic data generation, model training, and alignment. It targets researchers and practitioners needing reproducible, efficient, and accessible tools for LLM development, enabling the creation of custom datasets and fine-tuned models.

How It Works

DataDreamer facilitates multi-step prompting workflows with various LLMs, generates synthetic datasets for task augmentation, and supports model training techniques like fine-tuning, instruction-tuning, and distillation. Its design emphasizes simplicity, research-grade correctness, efficiency through caching and parameter-efficient methods (e.g., LoRA), and reproducibility via shareable workflows.

Quick Start & Requirements

Install via pip: pip3 install datadreamer.dev
See demo: demo.py
Additional demos and recipes: Quick Tour

Highlighted Details

Supports both open-source and API-based LLMs.
Features aggressive caching and resumability for efficiency.
Enables automatic generation of data cards and model cards.
Built with a focus on correctness, best practices, and reproducibility.

Maintenance & Community

Active development with contributions to Hugging Face and LiteLLM.
Contact via email (ajayp@upenn.edu) or Discord for questions and feedback.

Licensing & Compatibility

Released under the MIT License.
Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The library is research-grade and may require familiarity with LLM concepts for advanced usage. Specific hardware requirements for training or running large models are not detailed in the README.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days