synthetic-data-kit  by meta-llama

Synthetic data CLI tool for LLM fine-tuning

Created 9 months ago
1,456 stars

Top 27.9% on SourcePulse

GitHubView on GitHub
Project Summary

This toolkit simplifies the generation of high-quality synthetic datasets for fine-tuning Large Language Models (LLMs). It targets developers and researchers aiming to create task-specific datasets from various sources like PDFs, web pages, and YouTube videos, offering a structured workflow to convert raw data into fine-tuning-ready formats.

How It Works

The kit employs a modular, four-command CLI workflow: ingest to parse diverse file formats into text, create to generate synthetic data (QA pairs, CoT reasoning, summaries) using a local LLM via vLLM, curate to filter generated data based on quality using the LLM as a judge, and save-as to export data into various fine-tuning formats (Alpaca, OpenAI, ChatML, etc.) or storage backends (HF Arrow). This approach leverages local LLM inference for data generation and curation, offering flexibility and control over dataset quality and format.

Quick Start & Requirements

  • Install: pip install synthetic-data-kit
  • Prerequisites: Python 3.10+, vLLM server running a compatible LLM (e.g., meta-llama/Llama-3.3-70B-Instruct via vllm serve ... --port 8000), and specific parsing dependencies (e.g., pdfminer.six, beautifulsoup4, pytube).
  • Setup: Requires creating a specific directory structure (data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}) and ensuring the vLLM server is accessible.
  • Docs: Guide on using the tool

Highlighted Details

  • Supports ingestion from PDF, HTML, YouTube, DOCX, PPTX, and plain text.
  • Generates QA pairs and Chain-of-Thought (CoT) reasoning examples.
  • Includes a "curate" step using the LLM as a judge for quality filtering.
  • Outputs data in multiple fine-tuning formats including Alpaca, OpenAI, and ChatML, with support for HF Arrow storage.

Maintenance & Community

  • Developed by Meta AI.
  • Contribution guide available.

Licensing & Compatibility

  • License: Not explicitly stated in the README.
  • Compatibility: Designed for fine-tuning LLMs, compatible with standard fine-tuning workflows.

Limitations & Caveats

The tool requires a running vLLM server with a specific model, which can be resource-intensive. While it supports various input formats, successful parsing depends on the quality of the input files and installed parsing dependencies. The "curate" functionality's effectiveness is tied to the LLM's judging capabilities.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
1
Star History
38 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Maxime Labonne Maxime Labonne(Head of Post-Training at Liquid AI), and
2 more.

cosmopedia by huggingface

0%
557
Synthetic dataset for LLM training
Created 1 year ago
Updated 1 year ago
Starred by Ross Taylor Ross Taylor(Cofounder of General Reasoning; Cocreator of Papers with Code), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
3 more.

curator by bespokelabsai

0.2%
2k
Synthetic data curation tool for post-training and structured data extraction
Created 1 year ago
Updated 6 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Didier Lopes Didier Lopes(Founder of OpenBB), and
3 more.

instructlab by instructlab

0.3%
1k
CLI tool for LLM alignment tuning via synthetic data
Created 1 year ago
Updated 6 days ago
Starred by Alex Atallah Alex Atallah(Cofounder of OpenRouter, OpenSea), Shyamal Anadkat Shyamal Anadkat(Research Scientist at OpenAI), and
1 more.

gpt-llm-trainer by mshumer

0.1%
4k
LLM fine-tuning pipeline
Created 2 years ago
Updated 8 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory).

easy-dataset by ConardLi

0.6%
13k
Dataset tool for LLM fine-tuning
Created 10 months ago
Updated 1 day ago
Feedback? Help us improve.