deepfabric by always-further

Python library for synthetic dataset generation using LLMs

Created 1 year ago

777 stars

Top 45.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Michael Han

Cofounder of Unsloth

Elie Bursztein

Cybersecurity Lead at Google DeepMind

Project Summary

Promptwright is a Python library for generating large synthetic datasets using Large Language Models (LLMs), supporting both local LLMs (via Ollama, VLLM) and major cloud providers (OpenAI, Anthropic, etc.). It offers flexible configuration via YAML or Python code, enabling users to define complex data generation tasks with customizable prompts, instructions, and model parameters, ultimately facilitating the creation of high-quality training data for AI models.

How It Works

Promptwright leverages LiteLLM for broad LLM provider compatibility, allowing seamless switching between different models and APIs. It supports structured data generation through a "topic tree" approach, where prompts branch out to create hierarchical datasets, and a "data engine" for more direct instruction-based generation. Tasks are defined in YAML or programmatically, specifying model details, prompts, and output formats, with options to include system messages and push results directly to Hugging Face Hub.

Quick Start & Requirements

Install via pip: pip install promptwright
Development install: git clone https://github.com/StacklokLabs/promptwright.git && cd promptwright && poetry install
Prerequisites: Python 3.11+, Poetry (for development).
Optional: Hugging Face account and API token for uploads.
Usage: promptwright start config.yaml
Documentation: https://github.com/StacklokLabs/promptwright

Highlighted Details

Supports multiple LLM providers and local models via LiteLLM.
YAML configuration for defining generation tasks and Hugging Face Hub integration.
Command-line interface for running tasks and overriding configuration.
Automatic dataset card creation and tagging for Hugging Face Hub uploads.
Option to include or exclude system messages in the generated dataset.

Maintenance & Community

Developed by Stacklok Labs.
Contribution guidelines are available; open issues or pull requests are welcome.

Licensing & Compatibility

Licensed under the Apache 2.0 License.
Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The quality of generated data is highly dependent on prompt quality and the chosen LLM. The library does not guarantee data quality and acknowledges that LLMs can produce unpredictable or inappropriate content, or fail to adhere to formatting instructions, though error handling for invalid JSON is included.

Health Check

Last Commit

3 days ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

130 stars in the last 30 days