Synthetic data curation tool for post-training and structured data extraction
Top 28.6% on sourcepulse
Bespoke Curator is a Python library designed for generating and curating synthetic data, primarily for LLM post-training and structured data extraction. It offers a robust framework for creating high-quality datasets efficiently, catering to researchers and developers building and fine-tuning large language models.
How It Works
Curator leverages a Python-based pipeline approach, allowing users to define data generation steps using Pydantic models for structured outputs and custom curator.LLM
classes. It integrates with various LLM providers via LiteLLM and vLLM, supporting asynchronous operations, caching, and fault recovery for scalable data generation. The library emphasizes structured output parsing and chaining LLM calls for complex data pipelines.
Quick Start & Requirements
pip install bespokelabs-curator
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
4 days ago
1 day