synthetic-data-kit  by meta-llama

Synthetic data CLI tool for LLM fine-tuning

created 4 months ago
1,088 stars

Top 35.6% on sourcepulse

GitHubView on GitHub
Project Summary

This toolkit simplifies the generation of high-quality synthetic datasets for fine-tuning Large Language Models (LLMs). It targets developers and researchers aiming to create task-specific datasets from various sources like PDFs, web pages, and YouTube videos, offering a structured workflow to convert raw data into fine-tuning-ready formats.

How It Works

The kit employs a modular, four-command CLI workflow: ingest to parse diverse file formats into text, create to generate synthetic data (QA pairs, CoT reasoning, summaries) using a local LLM via vLLM, curate to filter generated data based on quality using the LLM as a judge, and save-as to export data into various fine-tuning formats (Alpaca, OpenAI, ChatML, etc.) or storage backends (HF Arrow). This approach leverages local LLM inference for data generation and curation, offering flexibility and control over dataset quality and format.

Quick Start & Requirements

  • Install: pip install synthetic-data-kit
  • Prerequisites: Python 3.10+, vLLM server running a compatible LLM (e.g., meta-llama/Llama-3.3-70B-Instruct via vllm serve ... --port 8000), and specific parsing dependencies (e.g., pdfminer.six, beautifulsoup4, pytube).
  • Setup: Requires creating a specific directory structure (data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}) and ensuring the vLLM server is accessible.
  • Docs: Guide on using the tool

Highlighted Details

  • Supports ingestion from PDF, HTML, YouTube, DOCX, PPTX, and plain text.
  • Generates QA pairs and Chain-of-Thought (CoT) reasoning examples.
  • Includes a "curate" step using the LLM as a judge for quality filtering.
  • Outputs data in multiple fine-tuning formats including Alpaca, OpenAI, and ChatML, with support for HF Arrow storage.

Maintenance & Community

  • Developed by Meta AI.
  • Contribution guide available.

Licensing & Compatibility

  • License: Not explicitly stated in the README.
  • Compatibility: Designed for fine-tuning LLMs, compatible with standard fine-tuning workflows.

Limitations & Caveats

The tool requires a running vLLM server with a specific model, which can be resource-intensive. While it supports various input formats, successful parsing depends on the quality of the input files and installed parsing dependencies. The "curate" functionality's effectiveness is tied to the LLM's judging capabilities.

Health Check
Last commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
9
Issues (30d)
14
Star History
584 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.