MathPile  by GAIR-NLP

Math corpus for pretraining language models

created 1 year ago
415 stars

Top 71.7% on sourcepulse

GitHubView on GitHub
Project Summary

MathPile is a 9.5 billion token, math-centric pretraining corpus designed to enhance the mathematical reasoning capabilities of large language models. It targets researchers and developers building foundation models for scientific and mathematical domains, offering a diverse and high-quality dataset that addresses the limitations of general-purpose or narrowly focused math corpora.

How It Works

MathPile aggregates data from a wide array of sources including textbooks, arXiv, Wikipedia, ProofWiki, StackExchange, and web pages, covering educational levels from K-12 to postgraduate and math competitions. The corpus emphasizes data quality over quantity, employing rigorous preprocessing, prefiltering, cleaning, and deduplication techniques. This meticulous approach ensures a high-quality, diverse dataset tailored specifically for mathematical tasks, differentiating it from broader corpora.

Quick Start & Requirements

The dataset is available on Hugging Face Datasets. Specific processing scripts are located in the src directory. No specific hardware requirements are mentioned beyond standard data processing capabilities.

Highlighted Details

  • 9.5 billion tokens of math-centric text.
  • Diverse sources: textbooks (~0.19B tokens), arXiv, Wikipedia, ProofWiki, StackExchange, web pages.
  • Extensive data documentation, including quality annotations and contamination detection against benchmarks like MATH and MMLU-STEM.
  • A commercially usable version, MathPile_Commercial, is also available.

Maintenance & Community

The project has been accepted to NeurIPS D&B Track 2024. It is featured on the Hugging Face Datasets trending list. The primary contributors are Zengzhi Wang, Xuefeng Li, Rui Xia, and Pengfei Liu.

Licensing & Compatibility

MathPile is licensed under CC BY-NC-SA 4.0. Source data with more restrictive licenses adhere to those terms. The license prohibits commercial use and requires share-alike distribution for derivative works. A separate commercial-use version is available.

Limitations & Caveats

The data collection and processing decisions may not be optimal, and some documents might not be of the highest quality. The creators are committed to ongoing refinement. Users are strongly urged to refrain from using the corpus for activities that harm national or social security or violate the law.

Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.