Language model pre-training toolkit
Top 90.9% on sourcepulse
MPNet offers a novel pre-training approach for language understanding tasks, aiming to improve upon BERT's masked language modeling and XLNet's permuted language modeling. It provides a unified implementation for various pre-training models and supports fine-tuning on common benchmarks like GLUE and SQuAD, targeting researchers and practitioners in NLP.
How It Works
MPNet employs a masked and permuted pre-training strategy, combining the strengths of MLM and PLM to achieve superior accuracy. This approach addresses limitations in existing methods by optimizing the pre-training objective for better language understanding. The implementation is built upon the fairseq codebase, allowing for flexible configuration and training.
Quick Start & Requirements
pip install --editable pretraining/
and pip install pytorch_transformers==1.0.0 transformers scipy sklearn
.dict.txt
) and tokenization scripts.MPNet.from_pretrained
.Highlighted Details
Maintenance & Community
The project is associated with Microsoft Research. The codebase is based on fairseq version 0.8.0. No specific community channels (like Discord/Slack) or roadmap are mentioned in the README.
Licensing & Compatibility
The README does not explicitly state a license. However, the project's association with Microsoft and its use of fairseq (MIT License) suggest it may be available for research and potentially commercial use, but this requires explicit verification.
Limitations & Caveats
The project is based on an older version of fairseq (0.8.0), which might pose compatibility challenges with newer libraries or require significant effort to update. The pre-training process is computationally intensive and requires substantial data preparation.
3 years ago
Inactive