cosmopedia by huggingface

Synthetic dataset for LLM training

Created 1 year ago

557 stars

Top 57.4% on SourcePulse

View on GitHub

4 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Maxime Labonne

Head of Post-Training at Liquid AI

Omar Sanseviero

DevRel at Google DeepMind

Luca Soldaini

Research Scientist at Ai2

Project Summary

Cosmopedia is a large-scale, open-source dataset of synthetic text designed to aid research in large language models (LLMs), particularly in the domain of synthetic data generation. It offers over 30 million files and 25 billion tokens, aiming to cover a broad spectrum of world knowledge mapped from existing web datasets.

How It Works

The dataset is generated using Mixtral-8x7B-Instruct-v0.1, with a focus on creating diverse content types including textbooks, blog posts, stories, and WikiHow articles. The process involves building prompts, large-scale generation using LLM-SWARM on over 10k H100 GPU hours, MinHash deduplication via datatrove, and n-gram decontamination against evaluation benchmarks.

Quick Start & Requirements

Dataset Access: The dataset is available via the 🤗 Cosmopedia dataset link.
Prerequisites: Requires significant computational resources for generation and processing (e.g., >10k H100 GPU hours for original generation). Accessing and utilizing the full dataset will likely require substantial storage and processing capabilities.

Highlighted Details

Largest open synthetic dataset to date (25B tokens).
Generated using Mixtral-8x7B-Instruct-v0.1.
Includes code for prompt building, generation, deduplication, and decontamination.
Mapped world knowledge from datasets like RefinedWeb and RedPajama.

Maintenance & Community

The project is associated with Hugging Face.
Links provided to the dataset, a 1B-LLM trained on Cosmopedia, and a blog post.

Licensing & Compatibility

License details are not explicitly stated in the provided README snippet. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

This is v0.1 of Cosmopedia, indicating potential for improvements and more comprehensive topic coverage. The extensive resource requirements for generation and processing may be a barrier for some users.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days