DataDreamer  by datadreamer-dev

Python library for synthetic data generation and training workflows

created 2 years ago
1,040 stars

Top 36.8% on sourcepulse

GitHubView on GitHub
Project Summary

DataDreamer is a Python library for creating and executing complex LLM workflows, focusing on synthetic data generation, model training, and alignment. It targets researchers and practitioners needing reproducible, efficient, and accessible tools for LLM development, enabling the creation of custom datasets and fine-tuned models.

How It Works

DataDreamer facilitates multi-step prompting workflows with various LLMs, generates synthetic datasets for task augmentation, and supports model training techniques like fine-tuning, instruction-tuning, and distillation. Its design emphasizes simplicity, research-grade correctness, efficiency through caching and parameter-efficient methods (e.g., LoRA), and reproducibility via shareable workflows.

Quick Start & Requirements

  • Install via pip: pip3 install datadreamer.dev
  • See demo: demo.py
  • Additional demos and recipes: Quick Tour

Highlighted Details

  • Supports both open-source and API-based LLMs.
  • Features aggressive caching and resumability for efficiency.
  • Enables automatic generation of data cards and model cards.
  • Built with a focus on correctness, best practices, and reproducibility.

Maintenance & Community

  • Active development with contributions to Hugging Face and LiteLLM.
  • Contact via email (ajayp@upenn.edu) or Discord for questions and feedback.

Licensing & Compatibility

  • Released under the MIT License.
  • Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The library is research-grade and may require familiarity with LLM concepts for advanced usage. Specific hardware requirements for training or running large models are not detailed in the README.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
30 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.