promptwright  by StacklokLabs

Python library for synthetic dataset generation using LLMs

created 9 months ago
438 stars

Top 69.2% on sourcepulse

GitHubView on GitHub
Project Summary

Promptwright is a Python library for generating large synthetic datasets using Large Language Models (LLMs), supporting both local LLMs (via Ollama, VLLM) and major cloud providers (OpenAI, Anthropic, etc.). It offers flexible configuration via YAML or Python code, enabling users to define complex data generation tasks with customizable prompts, instructions, and model parameters, ultimately facilitating the creation of high-quality training data for AI models.

How It Works

Promptwright leverages LiteLLM for broad LLM provider compatibility, allowing seamless switching between different models and APIs. It supports structured data generation through a "topic tree" approach, where prompts branch out to create hierarchical datasets, and a "data engine" for more direct instruction-based generation. Tasks are defined in YAML or programmatically, specifying model details, prompts, and output formats, with options to include system messages and push results directly to Hugging Face Hub.

Quick Start & Requirements

  • Install via pip: pip install promptwright
  • Development install: git clone https://github.com/StacklokLabs/promptwright.git && cd promptwright && poetry install
  • Prerequisites: Python 3.11+, Poetry (for development).
  • Optional: Hugging Face account and API token for uploads.
  • Usage: promptwright start config.yaml
  • Documentation: https://github.com/StacklokLabs/promptwright

Highlighted Details

  • Supports multiple LLM providers and local models via LiteLLM.
  • YAML configuration for defining generation tasks and Hugging Face Hub integration.
  • Command-line interface for running tasks and overriding configuration.
  • Automatic dataset card creation and tagging for Hugging Face Hub uploads.
  • Option to include or exclude system messages in the generated dataset.

Maintenance & Community

  • Developed by Stacklok Labs.
  • Contribution guidelines are available; open issues or pull requests are welcome.

Licensing & Compatibility

  • Licensed under the Apache 2.0 License.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The quality of generated data is highly dependent on prompt quality and the chosen LLM. The library does not guarantee data quality and acknowledges that LLMs can produce unpredictable or inappropriate content, or fail to adhere to formatting instructions, though error handling for invalid JSON is included.

Health Check
Last commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
23
Issues (30d)
0
Star History
30 stars in the last 90 days

Explore Similar Projects

Starred by Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
1 more.

promptable by cfortuner

0%
2k
TS/JS library for building full-stack AI apps
created 2 years ago
updated 2 years ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 4 days ago
Feedback? Help us improve.