Data refinement framework for improving pre-training data quality
Top 99.2% on sourcepulse
ProX is a framework for refining pre-training data for large language models, treating data cleaning as a programming task rather than relying on manual rules. It targets researchers and practitioners aiming to improve LLM performance and efficiency, offering significant gains, especially in specialized domains like mathematics.
How It Works
ProX employs a two-level programming and execution approach (doc-level and chunk-level) to automatically clean and enhance data examples. This method allows even smaller models to refine data effectively, mimicking expert-level quality improvements at scale, which is more cost-efficient than LLM-based data synthesis.
Quick Start & Requirements
prox
, Python 3.10), and install requirements (pip install -r requirements.txt
). FlashAttention and specific fused kernels may require manual compilation. Evaluation tools like lighteval
and math-eval
require separate Conda environments and installations.refining
, Python 3.10) and significant compute resources for processing >500B tokens.Highlighted Details
Maintenance & Community
The project is associated with GAIR-NLP. Key dependencies include TinyLlama, FlashAttention, DataTrove, LightEval, and MathEval.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification.
Limitations & Caveats
The README indicates ongoing development for some features ([🚧] ...
). Specific hardware requirements for optimized performance (e.g., FlashAttention compilation) are not fully detailed for all use cases.
3 weeks ago
1 day