synthetic-data-kit by meta-llama

Synthetic data CLI tool for LLM fine-tuning

Created 9 months ago

1,456 stars

Top 27.9% on SourcePulse

View on GitHub

3 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

This toolkit simplifies the generation of high-quality synthetic datasets for fine-tuning Large Language Models (LLMs). It targets developers and researchers aiming to create task-specific datasets from various sources like PDFs, web pages, and YouTube videos, offering a structured workflow to convert raw data into fine-tuning-ready formats.

How It Works

The kit employs a modular, four-command CLI workflow: ingest to parse diverse file formats into text, create to generate synthetic data (QA pairs, CoT reasoning, summaries) using a local LLM via vLLM, curate to filter generated data based on quality using the LLM as a judge, and save-as to export data into various fine-tuning formats (Alpaca, OpenAI, ChatML, etc.) or storage backends (HF Arrow). This approach leverages local LLM inference for data generation and curation, offering flexibility and control over dataset quality and format.

Quick Start & Requirements

Install: pip install synthetic-data-kit
Prerequisites: Python 3.10+, vLLM server running a compatible LLM (e.g., meta-llama/Llama-3.3-70B-Instruct via vllm serve ... --port 8000), and specific parsing dependencies (e.g., pdfminer.six, beautifulsoup4, pytube).
Setup: Requires creating a specific directory structure (data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}) and ensuring the vLLM server is accessible.
Docs: Guide on using the tool

Highlighted Details

Supports ingestion from PDF, HTML, YouTube, DOCX, PPTX, and plain text.
Generates QA pairs and Chain-of-Thought (CoT) reasoning examples.
Includes a "curate" step using the LLM as a judge for quality filtering.
Outputs data in multiple fine-tuning formats including Alpaca, OpenAI, and ChatML, with support for HF Arrow storage.

Maintenance & Community

Developed by Meta AI.
Contribution guide available.

Licensing & Compatibility

License: Not explicitly stated in the README.
Compatibility: Designed for fine-tuning LLMs, compatible with standard fine-tuning workflows.

Limitations & Caveats

The tool requires a running vLLM server with a specific model, which can be resource-intensive. While it supports various input formats, successful parsing depends on the quality of the input files and installed parsing dependencies. The "curate" functionality's effectiveness is tied to the LLM's judging capabilities.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

38 stars in the last 30 days