deepfabric  by lukehinds

Python library for synthetic dataset generation using LLMs

Created 10 months ago
449 stars

Top 67.0% on SourcePulse

GitHubView on GitHub
Project Summary

Promptwright is a Python library for generating large synthetic datasets using Large Language Models (LLMs), supporting both local LLMs (via Ollama, VLLM) and major cloud providers (OpenAI, Anthropic, etc.). It offers flexible configuration via YAML or Python code, enabling users to define complex data generation tasks with customizable prompts, instructions, and model parameters, ultimately facilitating the creation of high-quality training data for AI models.

How It Works

Promptwright leverages LiteLLM for broad LLM provider compatibility, allowing seamless switching between different models and APIs. It supports structured data generation through a "topic tree" approach, where prompts branch out to create hierarchical datasets, and a "data engine" for more direct instruction-based generation. Tasks are defined in YAML or programmatically, specifying model details, prompts, and output formats, with options to include system messages and push results directly to Hugging Face Hub.

Quick Start & Requirements

  • Install via pip: pip install promptwright
  • Development install: git clone https://github.com/StacklokLabs/promptwright.git && cd promptwright && poetry install
  • Prerequisites: Python 3.11+, Poetry (for development).
  • Optional: Hugging Face account and API token for uploads.
  • Usage: promptwright start config.yaml
  • Documentation: https://github.com/StacklokLabs/promptwright

Highlighted Details

  • Supports multiple LLM providers and local models via LiteLLM.
  • YAML configuration for defining generation tasks and Hugging Face Hub integration.
  • Command-line interface for running tasks and overriding configuration.
  • Automatic dataset card creation and tagging for Hugging Face Hub uploads.
  • Option to include or exclude system messages in the generated dataset.

Maintenance & Community

  • Developed by Stacklok Labs.
  • Contribution guidelines are available; open issues or pull requests are welcome.

Licensing & Compatibility

  • Licensed under the Apache 2.0 License.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The quality of generated data is highly dependent on prompt quality and the chosen LLM. The library does not guarantee data quality and acknowledges that LLMs can produce unpredictable or inappropriate content, or fail to adhere to formatting instructions, though error handling for invalid JSON is included.

Health Check
Last Commit

1 day ago

Responsiveness

1+ week

Pull Requests (30d)
71
Issues (30d)
8
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Bryan Helmig Bryan Helmig(Cofounder of Zapier), Will Brown Will Brown(Research Lead at Prime Intellect), and
1 more.

ReCall by Agent-RL

1.2%
1k
RL framework for LLM tool use
Created 6 months ago
Updated 4 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Stefan van der Walt Stefan van der Walt(Core Contributor to scientific Python ecosystem), and
12 more.

litgpt by Lightning-AI

0.1%
13k
LLM SDK for pretraining, finetuning, and deploying 20+ high-performance LLMs
Created 2 years ago
Updated 6 days ago
Feedback? Help us improve.