awesome-synthetic-datasets  by davanstrien

Curated list of synthetic text/vision datasets and generation tools

created 1 year ago
290 stars

Top 91.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository curates resources for generating synthetic text and vision datasets, primarily using Large Language Models (LLMs). It targets AI engineers and researchers seeking practical methods and tools to create artificial data for training and fine-tuning models, aiming to save time, cost, and carbon footprint.

How It Works

The project organizes tutorials, guides, and code examples demonstrating techniques like Self-Instruct and EvolInstruct for synthetic data generation. It highlights the use of LLMs to create diverse datasets, from short stories for small models (TinyStories) to large-scale instruction and chat samples (OpenHermes-2.5) and even website code with screenshots (WebSight). The approach leverages LLMs' generative capabilities to mimic real-world data patterns efficiently.

Quick Start & Requirements

  • Installation: Primarily through Python packages (e.g., pip install distilabel, pip install datadreamer).
  • Prerequisites: Python, LLM providers (e.g., Hugging Face Hub, OpenAI API keys), potentially GPU for local generation.
  • Resources: Links to official documentation, blog posts, and Hugging Face collections are provided for specific tools and datasets.

Highlighted Details

  • Curated list of significant synthetic datasets like TinyStories, OpenHermes-2.5, and Cosmopedia (25B tokens).
  • Features libraries like distilabel for flexible synthetic data generation and DataDreamer for efficient workflows.
  • Includes code examples for techniques such as Self-Instruct, EvolInstruct, and Self-Contrast for LLM alignment.
  • References key research papers that introduced foundational techniques in synthetic data generation.

Maintenance & Community

  • Actively developed libraries like distilabel are highlighted.
  • Links to Hugging Face collections and research papers indicate community engagement with the topic.

Licensing & Compatibility

  • Licenses vary by dataset and tool; synthetic_text_to_sql is Apache 2.0.
  • Compatibility for commercial use depends on the specific dataset or tool's license.

Limitations & Caveats

The repository is a curated list and not a single executable tool; users must integrate individual libraries and datasets. The effectiveness of generated data depends heavily on the chosen LLM and generation methodology.

Health Check
Last commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.