Synthetic data generator for language models
Top 62.5% on sourcepulse
This tool simplifies synthetic dataset generation for training language models, targeting AI developers and researchers. It leverages distilabel and LLMs to create tailored datasets for tasks like text classification, chat data, and RAG, accelerating AI development.
How It Works
The generator uses distilabel pipelines to interact with various LLM providers (Hugging Face, OpenAI, Ollama, vLLM) for data generation. Users define dataset characteristics and can iterate on samples, with options to push to Hugging Face Hub or Argilla for curation.
Quick Start & Requirements
pip install synthetic-dataset-generator
from synthetic_dataset_generator import launch; launch()
HF_TOKEN
is required for Hugging Face Hub integration and free completions. Optional variables control generation parameters (MAX_NUM_TOKENS
, MAX_NUM_ROWS
, DEFAULT_BATCH_SIZE
) and LLM provider configurations (MODEL
, API_KEY
, OPENAI_BASE_URL
, etc.).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
SFT and Chat Data generation are not supported with OpenAI Endpoints. Specific prompt template configurations are required for certain models not supported out-of-the-box.
2 months ago
1 day