synthetic-data-generator  by argilla-io

Synthetic data generator for language models

Created 1 year ago
556 stars

Top 57.5% on SourcePulse

GitHubView on GitHub
Project Summary

This tool simplifies synthetic dataset generation for training language models, targeting AI developers and researchers. It leverages distilabel and LLMs to create tailored datasets for tasks like text classification, chat data, and RAG, accelerating AI development.

How It Works

The generator uses distilabel pipelines to interact with various LLM providers (Hugging Face, OpenAI, Ollama, vLLM) for data generation. Users define dataset characteristics and can iterate on samples, with options to push to Hugging Face Hub or Argilla for curation.

Quick Start & Requirements

  • Install: pip install synthetic-dataset-generator
  • Run: from synthetic_dataset_generator import launch; launch()
  • Environment Variables: HF_TOKEN is required for Hugging Face Hub integration and free completions. Optional variables control generation parameters (MAX_NUM_TOKENS, MAX_NUM_ROWS, DEFAULT_BATCH_SIZE) and LLM provider configurations (MODEL, API_KEY, OPENAI_BASE_URL, etc.).
  • Docs: https://github.com/argilla-io/synthetic-data-generator

Highlighted Details

  • Supports multiple LLM providers and custom endpoints.
  • Integrates with Argilla for data curation and Hugging Face Hub for dataset sharing.
  • Customizable generation parameters and prompt templates.
  • Docker setup available for Ollama and Argilla integration.

Maintenance & Community

  • Project is pinned, indicating active maintenance.
  • Links to examples and development setup are provided.

Licensing & Compatibility

  • License: apache-2.0
  • Compatible with commercial use.

Limitations & Caveats

SFT and Chat Data generation are not supported with OpenAI Endpoints. Specific prompt template configurations are required for certain models not supported out-of-the-box.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Tri Dao Tri Dao(Chief Scientist at Together AI), and
1 more.

hnet by goombalab

0.3%
800
Hierarchical sequence modeling with dynamic chunking
Created 6 months ago
Updated 1 month ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
2 more.

Curator by NVIDIA-NeMo

1.0%
1k
Data curation toolkit for LLMs
Created 1 year ago
Updated 22 hours ago
Feedback? Help us improve.