Math corpus for pretraining language models
Top 71.7% on sourcepulse
MathPile is a 9.5 billion token, math-centric pretraining corpus designed to enhance the mathematical reasoning capabilities of large language models. It targets researchers and developers building foundation models for scientific and mathematical domains, offering a diverse and high-quality dataset that addresses the limitations of general-purpose or narrowly focused math corpora.
How It Works
MathPile aggregates data from a wide array of sources including textbooks, arXiv, Wikipedia, ProofWiki, StackExchange, and web pages, covering educational levels from K-12 to postgraduate and math competitions. The corpus emphasizes data quality over quantity, employing rigorous preprocessing, prefiltering, cleaning, and deduplication techniques. This meticulous approach ensures a high-quality, diverse dataset tailored specifically for mathematical tasks, differentiating it from broader corpora.
Quick Start & Requirements
The dataset is available on Hugging Face Datasets. Specific processing scripts are located in the src
directory. No specific hardware requirements are mentioned beyond standard data processing capabilities.
Highlighted Details
Maintenance & Community
The project has been accepted to NeurIPS D&B Track 2024. It is featured on the Hugging Face Datasets trending list. The primary contributors are Zengzhi Wang, Xuefeng Li, Rui Xia, and Pengfei Liu.
Licensing & Compatibility
MathPile is licensed under CC BY-NC-SA 4.0. Source data with more restrictive licenses adhere to those terms. The license prohibits commercial use and requires share-alike distribution for derivative works. A separate commercial-use version is available.
Limitations & Caveats
The data collection and processing decisions may not be optimal, and some documents might not be of the highest quality. The creators are committed to ongoing refinement. Users are strongly urged to refrain from using the corpus for activities that harm national or social security or violate the law.
4 months ago
1 week