ProX  by GAIR-NLP

Data refinement framework for improving pre-training data quality

created 10 months ago
255 stars

Top 99.2% on sourcepulse

GitHubView on GitHub
Project Summary

ProX is a framework for refining pre-training data for large language models, treating data cleaning as a programming task rather than relying on manual rules. It targets researchers and practitioners aiming to improve LLM performance and efficiency, offering significant gains, especially in specialized domains like mathematics.

How It Works

ProX employs a two-level programming and execution approach (doc-level and chunk-level) to automatically clean and enhance data examples. This method allows even smaller models to refine data effectively, mimicking expert-level quality improvements at scale, which is more cost-efficient than LLM-based data synthesis.

Quick Start & Requirements

  • Installation: Clone the repository, create a Conda environment (prox, Python 3.10), and install requirements (pip install -r requirements.txt). FlashAttention and specific fused kernels may require manual compilation. Evaluation tools like lighteval and math-eval require separate Conda environments and installations.
  • Prerequisites: Python 3.10, Conda, Git. FlashAttention installation may require specific build steps.
  • Resources: Large-scale data refining requires a separate environment (refining, Python 3.10) and significant compute resources for processing >500B tokens.
  • Links: Codebase, Refining Models, LightEval, MathEval.

Highlighted Details

  • Models trained on ProX-refined data show over 2% performance improvement compared to raw or rule-based data.
  • Achieves up to 20% accuracy boost in math tasks without manual adjustments.
  • Offers >100B general domain and ~5B math corpus, plus trained models (ProX, ProXMath).
  • Released DCLM-pro (>500B tokens) shows >1.5% gain within 50B tokens.

Maintenance & Community

The project is associated with GAIR-NLP. Key dependencies include TinyLlama, FlashAttention, DataTrove, LightEval, and MathEval.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

The README indicates ongoing development for some features ([🚧] ...). Specific hardware requirements for optimized performance (e.g., FlashAttention compilation) are not fully detailed for all use cases.

Health Check
Last commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Feedback? Help us improve.