ProX by GAIR-NLP

Data refinement framework for improving pre-training data quality

Created 1 year ago

264 stars

Top 96.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luca Soldaini

Research Scientist at Ai2

Project Summary

ProX is a framework for refining pre-training data for large language models, treating data cleaning as a programming task rather than relying on manual rules. It targets researchers and practitioners aiming to improve LLM performance and efficiency, offering significant gains, especially in specialized domains like mathematics.

How It Works

ProX employs a two-level programming and execution approach (doc-level and chunk-level) to automatically clean and enhance data examples. This method allows even smaller models to refine data effectively, mimicking expert-level quality improvements at scale, which is more cost-efficient than LLM-based data synthesis.

Quick Start & Requirements

Installation: Clone the repository, create a Conda environment (prox, Python 3.10), and install requirements (pip install -r requirements.txt). FlashAttention and specific fused kernels may require manual compilation. Evaluation tools like lighteval and math-eval require separate Conda environments and installations.
Prerequisites: Python 3.10, Conda, Git. FlashAttention installation may require specific build steps.
Resources: Large-scale data refining requires a separate environment (refining, Python 3.10) and significant compute resources for processing >500B tokens.
Links: Codebase, Refining Models, LightEval, MathEval.

Highlighted Details

Models trained on ProX-refined data show over 2% performance improvement compared to raw or rule-based data.
Achieves up to 20% accuracy boost in math tasks without manual adjustments.
Offers >100B general domain and ~5B math corpus, plus trained models (ProX, ProXMath).
Released DCLM-pro (>500B tokens) shows >1.5% gain within 50B tokens.

Maintenance & Community

The project is associated with GAIR-NLP. Key dependencies include TinyLlama, FlashAttention, DataTrove, LightEval, and MathEval.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

The README indicates ongoing development for some features ([🚧] ...). Specific hardware requirements for optimized performance (e.g., FlashAttention compilation) are not fully detailed for all use cases.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days