Synthetic data CLI tool for LLM fine-tuning
Top 35.6% on sourcepulse
This toolkit simplifies the generation of high-quality synthetic datasets for fine-tuning Large Language Models (LLMs). It targets developers and researchers aiming to create task-specific datasets from various sources like PDFs, web pages, and YouTube videos, offering a structured workflow to convert raw data into fine-tuning-ready formats.
How It Works
The kit employs a modular, four-command CLI workflow: ingest
to parse diverse file formats into text, create
to generate synthetic data (QA pairs, CoT reasoning, summaries) using a local LLM via vLLM, curate
to filter generated data based on quality using the LLM as a judge, and save-as
to export data into various fine-tuning formats (Alpaca, OpenAI, ChatML, etc.) or storage backends (HF Arrow). This approach leverages local LLM inference for data generation and curation, offering flexibility and control over dataset quality and format.
Quick Start & Requirements
pip install synthetic-data-kit
meta-llama/Llama-3.3-70B-Instruct
via vllm serve ... --port 8000
), and specific parsing dependencies (e.g., pdfminer.six
, beautifulsoup4
, pytube
).data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}
) and ensuring the vLLM server is accessible.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The tool requires a running vLLM server with a specific model, which can be resource-intensive. While it supports various input formats, successful parsing depends on the quality of the input files and installed parsing dependencies. The "curate" functionality's effectiveness is tied to the LLM's judging capabilities.
1 week ago
Inactive