Python library for synthetic dataset generation using LLMs
Top 69.2% on sourcepulse
Promptwright is a Python library for generating large synthetic datasets using Large Language Models (LLMs), supporting both local LLMs (via Ollama, VLLM) and major cloud providers (OpenAI, Anthropic, etc.). It offers flexible configuration via YAML or Python code, enabling users to define complex data generation tasks with customizable prompts, instructions, and model parameters, ultimately facilitating the creation of high-quality training data for AI models.
How It Works
Promptwright leverages LiteLLM for broad LLM provider compatibility, allowing seamless switching between different models and APIs. It supports structured data generation through a "topic tree" approach, where prompts branch out to create hierarchical datasets, and a "data engine" for more direct instruction-based generation. Tasks are defined in YAML or programmatically, specifying model details, prompts, and output formats, with options to include system messages and push results directly to Hugging Face Hub.
Quick Start & Requirements
pip install promptwright
git clone https://github.com/StacklokLabs/promptwright.git && cd promptwright && poetry install
promptwright start config.yaml
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The quality of generated data is highly dependent on prompt quality and the chosen LLM. The library does not guarantee data quality and acknowledges that LLMs can produce unpredictable or inappropriate content, or fail to adhere to formatting instructions, though error handling for invalid JSON is included.
2 days ago
Inactive