MathPile by GAIR-NLP

Math corpus for pretraining language models

Created 2 years ago

418 stars

Top 69.7% on SourcePulse

View on GitHub

3 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Elvis Saravia

Founder of DAIR.AI

Binyuan Hui

Research Scientist at Alibaba Qwen

Project Summary

MathPile is a 9.5 billion token, math-centric pretraining corpus designed to enhance the mathematical reasoning capabilities of large language models. It targets researchers and developers building foundation models for scientific and mathematical domains, offering a diverse and high-quality dataset that addresses the limitations of general-purpose or narrowly focused math corpora.

How It Works

MathPile aggregates data from a wide array of sources including textbooks, arXiv, Wikipedia, ProofWiki, StackExchange, and web pages, covering educational levels from K-12 to postgraduate and math competitions. The corpus emphasizes data quality over quantity, employing rigorous preprocessing, prefiltering, cleaning, and deduplication techniques. This meticulous approach ensures a high-quality, diverse dataset tailored specifically for mathematical tasks, differentiating it from broader corpora.

Quick Start & Requirements

The dataset is available on Hugging Face Datasets. Specific processing scripts are located in the src directory. No specific hardware requirements are mentioned beyond standard data processing capabilities.

Highlighted Details

9.5 billion tokens of math-centric text.
Diverse sources: textbooks (~0.19B tokens), arXiv, Wikipedia, ProofWiki, StackExchange, web pages.
Extensive data documentation, including quality annotations and contamination detection against benchmarks like MATH and MMLU-STEM.
A commercially usable version, MathPile_Commercial, is also available.

Maintenance & Community

The project has been accepted to NeurIPS D&B Track 2024. It is featured on the Hugging Face Datasets trending list. The primary contributors are Zengzhi Wang, Xuefeng Li, Rui Xia, and Pengfei Liu.

Licensing & Compatibility

MathPile is licensed under CC BY-NC-SA 4.0. Source data with more restrictive licenses adhere to those terms. The license prohibits commercial use and requires share-alike distribution for derivative works. A separate commercial-use version is available.

Limitations & Caveats

The data collection and processing decisions may not be optimal, and some documents might not be of the highest quality. The creators are committed to ongoing refinement. Users are strongly urged to refrain from using the corpus for activities that harm national or social security or violate the law.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days