curator  by bespokelabsai

Synthetic data curation tool for post-training and structured data extraction

created 9 months ago
1,465 stars

Top 28.6% on sourcepulse

GitHubView on GitHub
Project Summary

Bespoke Curator is a Python library designed for generating and curating synthetic data, primarily for LLM post-training and structured data extraction. It offers a robust framework for creating high-quality datasets efficiently, catering to researchers and developers building and fine-tuning large language models.

How It Works

Curator leverages a Python-based pipeline approach, allowing users to define data generation steps using Pydantic models for structured outputs and custom curator.LLM classes. It integrates with various LLM providers via LiteLLM and vLLM, supporting asynchronous operations, caching, and fault recovery for scalable data generation. The library emphasizes structured output parsing and chaining LLM calls for complex data pipelines.

Quick Start & Requirements

  • Install via pip: pip install bespokelabs-curator
  • Requires API keys for LLM providers (e.g., OpenAI, Anthropic, Gemini) set as environment variables.
  • Supports local execution and various backends like Ray, Docker, and e2b for code execution.
  • Official documentation: https://docs.bespokelabs.ai/bespoke-curator/getting-started

Highlighted Details

  • Supports batch processing for significant cost savings (up to 50%) with providers like OpenAI and Anthropic.
  • Includes a viewer for real-time data visualization during generation.
  • Offers built-in caching and retries for robust pipeline execution.
  • Facilitates code execution for generated code and multimodal data generation.

Maintenance & Community

  • Active development with recent updates including Gemini batch support and Claude 3.7 Sonnet integration.
  • Community support via Discord: https://discord.gg/KqpXvpzVBS
  • Active presence on Hugging Face and X (Twitter).

Licensing & Compatibility

  • The repository does not explicitly state a license in the provided README.

Limitations & Caveats

  • The README mentions intermittent issues with the DeepSeek API, recommending specific configurations for reliability.
  • Anonymized telemetry is collected by default, though opt-out is available.
Health Check
Last commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
3
Star History
187 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

1.6%
1k
Synthetic data CLI tool for LLM fine-tuning
created 4 months ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Daniel Han Daniel Han(Cofounder of Unsloth).

Kiln by Kiln-AI

0.6%
4k
AI prototyping and dataset collaboration tool
created 1 year ago
updated 22 hours ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 5 days ago
Feedback? Help us improve.