synthetic-data-generator  by argilla-io

Synthetic data generator for language models

Created 10 months ago
528 stars

Top 60.0% on SourcePulse

GitHubView on GitHub
Project Summary

This tool simplifies synthetic dataset generation for training language models, targeting AI developers and researchers. It leverages distilabel and LLMs to create tailored datasets for tasks like text classification, chat data, and RAG, accelerating AI development.

How It Works

The generator uses distilabel pipelines to interact with various LLM providers (Hugging Face, OpenAI, Ollama, vLLM) for data generation. Users define dataset characteristics and can iterate on samples, with options to push to Hugging Face Hub or Argilla for curation.

Quick Start & Requirements

  • Install: pip install synthetic-dataset-generator
  • Run: from synthetic_dataset_generator import launch; launch()
  • Environment Variables: HF_TOKEN is required for Hugging Face Hub integration and free completions. Optional variables control generation parameters (MAX_NUM_TOKENS, MAX_NUM_ROWS, DEFAULT_BATCH_SIZE) and LLM provider configurations (MODEL, API_KEY, OPENAI_BASE_URL, etc.).
  • Docs: https://github.com/argilla-io/synthetic-data-generator

Highlighted Details

  • Supports multiple LLM providers and custom endpoints.
  • Integrates with Argilla for data curation and Hugging Face Hub for dataset sharing.
  • Customizable generation parameters and prompt templates.
  • Docker setup available for Ollama and Argilla integration.

Maintenance & Community

  • Project is pinned, indicating active maintenance.
  • Links to examples and development setup are provided.

Licensing & Compatibility

  • License: apache-2.0
  • Compatible with commercial use.

Limitations & Caveats

SFT and Chat Data generation are not supported with OpenAI Endpoints. Specific prompt template configurations are required for certain models not supported out-of-the-box.

Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Tri Dao Tri Dao(Chief Scientist at Together AI), and
1 more.

hnet by goombalab

1.5%
722
Hierarchical sequence modeling with dynamic chunking
Created 2 months ago
Updated 1 month ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
1 more.

Curator by NVIDIA-NeMo

1.3%
1k
Data curation toolkit for LLMs
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.