synthetic-data-generator  by argilla-io

Synthetic data generator for language models

created 8 months ago
505 stars

Top 62.5% on sourcepulse

GitHubView on GitHub
Project Summary

This tool simplifies synthetic dataset generation for training language models, targeting AI developers and researchers. It leverages distilabel and LLMs to create tailored datasets for tasks like text classification, chat data, and RAG, accelerating AI development.

How It Works

The generator uses distilabel pipelines to interact with various LLM providers (Hugging Face, OpenAI, Ollama, vLLM) for data generation. Users define dataset characteristics and can iterate on samples, with options to push to Hugging Face Hub or Argilla for curation.

Quick Start & Requirements

  • Install: pip install synthetic-dataset-generator
  • Run: from synthetic_dataset_generator import launch; launch()
  • Environment Variables: HF_TOKEN is required for Hugging Face Hub integration and free completions. Optional variables control generation parameters (MAX_NUM_TOKENS, MAX_NUM_ROWS, DEFAULT_BATCH_SIZE) and LLM provider configurations (MODEL, API_KEY, OPENAI_BASE_URL, etc.).
  • Docs: https://github.com/argilla-io/synthetic-data-generator

Highlighted Details

  • Supports multiple LLM providers and custom endpoints.
  • Integrates with Argilla for data curation and Hugging Face Hub for dataset sharing.
  • Customizable generation parameters and prompt templates.
  • Docker setup available for Ollama and Argilla integration.

Maintenance & Community

  • Project is pinned, indicating active maintenance.
  • Links to examples and development setup are provided.

Licensing & Compatibility

  • License: apache-2.0
  • Compatible with commercial use.

Limitations & Caveats

SFT and Chat Data generation are not supported with OpenAI Endpoints. Specific prompt template configurations are required for certain models not supported out-of-the-box.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
40 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 5 days ago
Feedback? Help us improve.