curator by bespokelabsai

Synthetic data curation tool for post-training and structured data extraction

Created 1 year ago

1,594 stars

Top 26.1% on SourcePulse

View on GitHub

5 Experts Love This Project

Ross Taylor

Cofounder of General Reasoning; Cocreator of Papers with Code

Yaowei Zheng

Author of LLaMA-Factory

Wing Lian

Founder of Axolotl AI

Jeffrey Morgan

Cofounder of Ollama

and 1 more!

Project Summary

Bespoke Curator is a Python library designed for generating and curating synthetic data, primarily for LLM post-training and structured data extraction. It offers a robust framework for creating high-quality datasets efficiently, catering to researchers and developers building and fine-tuning large language models.

How It Works

Curator leverages a Python-based pipeline approach, allowing users to define data generation steps using Pydantic models for structured outputs and custom curator.LLM classes. It integrates with various LLM providers via LiteLLM and vLLM, supporting asynchronous operations, caching, and fault recovery for scalable data generation. The library emphasizes structured output parsing and chaining LLM calls for complex data pipelines.

Quick Start & Requirements

Install via pip: pip install bespokelabs-curator
Requires API keys for LLM providers (e.g., OpenAI, Anthropic, Gemini) set as environment variables.
Supports local execution and various backends like Ray, Docker, and e2b for code execution.
Official documentation: https://docs.bespokelabs.ai/bespoke-curator/getting-started

Highlighted Details

Supports batch processing for significant cost savings (up to 50%) with providers like OpenAI and Anthropic.
Includes a viewer for real-time data visualization during generation.
Offers built-in caching and retries for robust pipeline execution.
Facilitates code execution for generated code and multimodal data generation.

Maintenance & Community

Active development with recent updates including Gemini batch support and Claude 3.7 Sonnet integration.
Community support via Discord: https://discord.gg/KqpXvpzVBS
Active presence on Hugging Face and X (Twitter).

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

The README mentions intermittent issues with the DeepSeek API, recommending specific configurations for reliability.
Anonymized telemetry is collected by default, though opt-out is available.

Health Check

Last Commit

6 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

23 stars in the last 30 days