DataDreamer  by datadreamer-dev

Python library for synthetic data generation and training workflows

Created 2 years ago
1,086 stars

Top 35.0% on SourcePulse

GitHubView on GitHub
Project Summary

DataDreamer is a Python library for creating and executing complex LLM workflows, focusing on synthetic data generation, model training, and alignment. It targets researchers and practitioners needing reproducible, efficient, and accessible tools for LLM development, enabling the creation of custom datasets and fine-tuned models.

How It Works

DataDreamer facilitates multi-step prompting workflows with various LLMs, generates synthetic datasets for task augmentation, and supports model training techniques like fine-tuning, instruction-tuning, and distillation. Its design emphasizes simplicity, research-grade correctness, efficiency through caching and parameter-efficient methods (e.g., LoRA), and reproducibility via shareable workflows.

Quick Start & Requirements

  • Install via pip: pip3 install datadreamer.dev
  • See demo: demo.py
  • Additional demos and recipes: Quick Tour

Highlighted Details

  • Supports both open-source and API-based LLMs.
  • Features aggressive caching and resumability for efficiency.
  • Enables automatic generation of data cards and model cards.
  • Built with a focus on correctness, best practices, and reproducibility.

Maintenance & Community

  • Active development with contributions to Hugging Face and LiteLLM.
  • Contact via email (ajayp@upenn.edu) or Discord for questions and feedback.

Licensing & Compatibility

  • Released under the MIT License.
  • Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The library is research-grade and may require familiarity with LLM concepts for advanced usage. Specific hardware requirements for training or running large models are not detailed in the README.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Ross Taylor Ross Taylor(Cofounder of General Reasoning; Cocreator of Papers with Code), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
3 more.

curator by bespokelabsai

0.2%
2k
Synthetic data curation tool for post-training and structured data extraction
Created 1 year ago
Updated 6 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
1 more.

Kiln by Kiln-AI

0.3%
5k
AI prototyping and dataset collaboration tool
Created 1 year ago
Updated 19 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Didier Lopes Didier Lopes(Founder of OpenBB), and
3 more.

instructlab by instructlab

0.3%
1k
CLI tool for LLM alignment tuning via synthetic data
Created 1 year ago
Updated 6 days ago
Feedback? Help us improve.