MPNet by microsoft

Language model pre-training toolkit

Created 5 years ago

296 stars

Top 89.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

MPNet offers a novel pre-training approach for language understanding tasks, aiming to improve upon BERT's masked language modeling and XLNet's permuted language modeling. It provides a unified implementation for various pre-training models and supports fine-tuning on common benchmarks like GLUE and SQuAD, targeting researchers and practitioners in NLP.

How It Works

MPNet employs a masked and permuted pre-training strategy, combining the strengths of MLM and PLM to achieve superior accuracy. This approach addresses limitations in existing methods by optimizing the pre-training objective for better language understanding. The implementation is built upon the fairseq codebase, allowing for flexible configuration and training.

Quick Start & Requirements

Installation: pip install --editable pretraining/ and pip install pytorch_transformers==1.0.0 transformers scipy sklearn.
Prerequisites: PyTorch, transformers library, SciPy, scikit-learn. Data preprocessing requires a BERT dictionary (dict.txt) and tokenization scripts.
Pre-training: Requires significant computational resources and a large corpus (e.g., WikiText-103). The provided command outlines parameters for training, including batch size, learning rate, and sequence length.
Fine-tuning: Pre-trained models can be loaded using MPNet.from_pretrained.
Documentation: arXiv Paper

Highlighted Details

Unified implementation for BERT, XLNet, and MPNet.
Supports pre-training and fine-tuning for GLUE, SQuAD, and RACE tasks.
Offers options for relative position embeddings and whole-word masking.
Can be configured for MLM or PLM input modes.

Maintenance & Community

The project is associated with Microsoft Research. The codebase is based on fairseq version 0.8.0. No specific community channels (like Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. However, the project's association with Microsoft and its use of fairseq (MIT License) suggest it may be available for research and potentially commercial use, but this requires explicit verification.

Limitations & Caveats

The project is based on an older version of fairseq (0.8.0), which might pose compatibility challenges with newer libraries or require significant effort to update. The pre-training process is computationally intensive and requires substantial data preparation.

Health Check

Last Commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days