cosmopedia  by huggingface

Synthetic dataset for LLM training

created 1 year ago
529 stars

Top 60.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Cosmopedia is a large-scale, open-source dataset of synthetic text designed to aid research in large language models (LLMs), particularly in the domain of synthetic data generation. It offers over 30 million files and 25 billion tokens, aiming to cover a broad spectrum of world knowledge mapped from existing web datasets.

How It Works

The dataset is generated using Mixtral-8x7B-Instruct-v0.1, with a focus on creating diverse content types including textbooks, blog posts, stories, and WikiHow articles. The process involves building prompts, large-scale generation using LLM-SWARM on over 10k H100 GPU hours, MinHash deduplication via datatrove, and n-gram decontamination against evaluation benchmarks.

Quick Start & Requirements

  • Dataset Access: The dataset is available via the 🤗 Cosmopedia dataset link.
  • Prerequisites: Requires significant computational resources for generation and processing (e.g., >10k H100 GPU hours for original generation). Accessing and utilizing the full dataset will likely require substantial storage and processing capabilities.

Highlighted Details

  • Largest open synthetic dataset to date (25B tokens).
  • Generated using Mixtral-8x7B-Instruct-v0.1.
  • Includes code for prompt building, generation, deduplication, and decontamination.
  • Mapped world knowledge from datasets like RefinedWeb and RedPajama.

Maintenance & Community

  • The project is associated with Hugging Face.
  • Links provided to the dataset, a 1B-LLM trained on Cosmopedia, and a blog post.

Licensing & Compatibility

  • License details are not explicitly stated in the provided README snippet. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

This is v0.1 of Cosmopedia, indicating potential for improvements and more comprehensive topic coverage. The extensive resource requirements for generation and processing may be a barrier for some users.

Health Check
Last commit

8 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.