awesome-synthetic-datasets by davanstrien

Curated list of synthetic text/vision datasets and generation tools

Created 1 year ago

320 stars

Top 84.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

This repository curates resources for generating synthetic text and vision datasets, primarily using Large Language Models (LLMs). It targets AI engineers and researchers seeking practical methods and tools to create artificial data for training and fine-tuning models, aiming to save time, cost, and carbon footprint.

How It Works

The project organizes tutorials, guides, and code examples demonstrating techniques like Self-Instruct and EvolInstruct for synthetic data generation. It highlights the use of LLMs to create diverse datasets, from short stories for small models (TinyStories) to large-scale instruction and chat samples (OpenHermes-2.5) and even website code with screenshots (WebSight). The approach leverages LLMs' generative capabilities to mimic real-world data patterns efficiently.

Quick Start & Requirements

Installation: Primarily through Python packages (e.g., pip install distilabel, pip install datadreamer).
Prerequisites: Python, LLM providers (e.g., Hugging Face Hub, OpenAI API keys), potentially GPU for local generation.
Resources: Links to official documentation, blog posts, and Hugging Face collections are provided for specific tools and datasets.

Highlighted Details

Curated list of significant synthetic datasets like TinyStories, OpenHermes-2.5, and Cosmopedia (25B tokens).
Features libraries like distilabel for flexible synthetic data generation and DataDreamer for efficient workflows.
Includes code examples for techniques such as Self-Instruct, EvolInstruct, and Self-Contrast for LLM alignment.
References key research papers that introduced foundational techniques in synthetic data generation.

Maintenance & Community

Actively developed libraries like distilabel are highlighted.
Links to Hugging Face collections and research papers indicate community engagement with the topic.

Licensing & Compatibility

Licenses vary by dataset and tool; synthetic_text_to_sql is Apache 2.0.
Compatibility for commercial use depends on the specific dataset or tool's license.

Limitations & Caveats

The repository is a curated list and not a single executable tool; users must integrate individual libraries and datasets. The effectiveness of generated data depends heavily on the chosen LLM and generation methodology.

Health Check

Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days