Synthetic dataset for LLM training
Top 60.6% on sourcepulse
Cosmopedia is a large-scale, open-source dataset of synthetic text designed to aid research in large language models (LLMs), particularly in the domain of synthetic data generation. It offers over 30 million files and 25 billion tokens, aiming to cover a broad spectrum of world knowledge mapped from existing web datasets.
How It Works
The dataset is generated using Mixtral-8x7B-Instruct-v0.1, with a focus on creating diverse content types including textbooks, blog posts, stories, and WikiHow articles. The process involves building prompts, large-scale generation using LLM-SWARM on over 10k H100 GPU hours, MinHash deduplication via datatrove, and n-gram decontamination against evaluation benchmarks.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
This is v0.1 of Cosmopedia, indicating potential for improvements and more comprehensive topic coverage. The extensive resource requirements for generation and processing may be a barrier for some users.
8 months ago
1 week