synthetic-data-generator by argilla-io

Synthetic data generator for language models

Created 1 year ago

556 stars

Top 57.5% on SourcePulse

Project Summary

This tool simplifies synthetic dataset generation for training language models, targeting AI developers and researchers. It leverages distilabel and LLMs to create tailored datasets for tasks like text classification, chat data, and RAG, accelerating AI development.

How It Works

The generator uses distilabel pipelines to interact with various LLM providers (Hugging Face, OpenAI, Ollama, vLLM) for data generation. Users define dataset characteristics and can iterate on samples, with options to push to Hugging Face Hub or Argilla for curation.

Quick Start & Requirements

Install: pip install synthetic-dataset-generator
Run: from synthetic_dataset_generator import launch; launch()
Environment Variables: HF_TOKEN is required for Hugging Face Hub integration and free completions. Optional variables control generation parameters (MAX_NUM_TOKENS, MAX_NUM_ROWS, DEFAULT_BATCH_SIZE) and LLM provider configurations (MODEL, API_KEY, OPENAI_BASE_URL, etc.).
Docs: https://github.com/argilla-io/synthetic-data-generator

Highlighted Details

Supports multiple LLM providers and custom endpoints.
Integrates with Argilla for data curation and Hugging Face Hub for dataset sharing.
Customizable generation parameters and prompt templates.
Docker setup available for Ollama and Argilla integration.

Maintenance & Community

Project is pinned, indicating active maintenance.
Links to examples and development setup are provided.

Licensing & Compatibility

License: apache-2.0
Compatible with commercial use.

Limitations & Caveats

SFT and Chat Data generation are not supported with OpenAI Endpoints. Specific prompt template configurations are required for certain models not supported out-of-the-box.

synthetic-data-generator by argilla-io

Explore Similar Projects

awesome-open-source-ai by suncloudsmoon

awesome-synthetic-datasets by davanstrien

universal-ner by universal-ner

DialogStudio by salesforce

loong by camel-ai

bonito by BatsResearch

mostlyai by mostly-ai

SolidUI by CloudOrc

hnet by goombalab

persona-hub by tencent-ailab

prompt2model by neulab

Curator by NVIDIA-NeMo