cosmopedia  by huggingface

Synthetic dataset for LLM training

Created 1 year ago
540 stars

Top 58.9% on SourcePulse

GitHubView on GitHub
Project Summary

Cosmopedia is a large-scale, open-source dataset of synthetic text designed to aid research in large language models (LLMs), particularly in the domain of synthetic data generation. It offers over 30 million files and 25 billion tokens, aiming to cover a broad spectrum of world knowledge mapped from existing web datasets.

How It Works

The dataset is generated using Mixtral-8x7B-Instruct-v0.1, with a focus on creating diverse content types including textbooks, blog posts, stories, and WikiHow articles. The process involves building prompts, large-scale generation using LLM-SWARM on over 10k H100 GPU hours, MinHash deduplication via datatrove, and n-gram decontamination against evaluation benchmarks.

Quick Start & Requirements

  • Dataset Access: The dataset is available via the 🤗 Cosmopedia dataset link.
  • Prerequisites: Requires significant computational resources for generation and processing (e.g., >10k H100 GPU hours for original generation). Accessing and utilizing the full dataset will likely require substantial storage and processing capabilities.

Highlighted Details

  • Largest open synthetic dataset to date (25B tokens).
  • Generated using Mixtral-8x7B-Instruct-v0.1.
  • Includes code for prompt building, generation, deduplication, and decontamination.
  • Mapped world knowledge from datasets like RefinedWeb and RedPajama.

Maintenance & Community

  • The project is associated with Hugging Face.
  • Links provided to the dataset, a 1B-LLM trained on Cosmopedia, and a blog post.

Licensing & Compatibility

  • License details are not explicitly stated in the provided README snippet. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

This is v0.1 of Cosmopedia, indicating potential for improvements and more comprehensive topic coverage. The extensive resource requirements for generation and processing may be a barrier for some users.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

0.8%
1k
Synthetic data CLI tool for LLM fine-tuning
Created 5 months ago
Updated 1 month ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
1 more.

Curator by NVIDIA-NeMo

1.3%
1k
Data curation toolkit for LLMs
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.