MPNet  by microsoft

Language model pre-training toolkit

created 5 years ago
294 stars

Top 90.9% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

MPNet offers a novel pre-training approach for language understanding tasks, aiming to improve upon BERT's masked language modeling and XLNet's permuted language modeling. It provides a unified implementation for various pre-training models and supports fine-tuning on common benchmarks like GLUE and SQuAD, targeting researchers and practitioners in NLP.

How It Works

MPNet employs a masked and permuted pre-training strategy, combining the strengths of MLM and PLM to achieve superior accuracy. This approach addresses limitations in existing methods by optimizing the pre-training objective for better language understanding. The implementation is built upon the fairseq codebase, allowing for flexible configuration and training.

Quick Start & Requirements

  • Installation: pip install --editable pretraining/ and pip install pytorch_transformers==1.0.0 transformers scipy sklearn.
  • Prerequisites: PyTorch, transformers library, SciPy, scikit-learn. Data preprocessing requires a BERT dictionary (dict.txt) and tokenization scripts.
  • Pre-training: Requires significant computational resources and a large corpus (e.g., WikiText-103). The provided command outlines parameters for training, including batch size, learning rate, and sequence length.
  • Fine-tuning: Pre-trained models can be loaded using MPNet.from_pretrained.
  • Documentation: arXiv Paper

Highlighted Details

  • Unified implementation for BERT, XLNet, and MPNet.
  • Supports pre-training and fine-tuning for GLUE, SQuAD, and RACE tasks.
  • Offers options for relative position embeddings and whole-word masking.
  • Can be configured for MLM or PLM input modes.

Maintenance & Community

The project is associated with Microsoft Research. The codebase is based on fairseq version 0.8.0. No specific community channels (like Discord/Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. However, the project's association with Microsoft and its use of fairseq (MIT License) suggest it may be available for research and potentially commercial use, but this requires explicit verification.

Limitations & Caveats

The project is based on an older version of fairseq (0.8.0), which might pose compatibility challenges with newer libraries or require significant effort to update. The pre-training process is computationally intensive and requires substantial data preparation.

Health Check
Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), Abhishek Thakur Abhishek Thakur(World's First 4x Kaggle GrandMaster), and
5 more.

xlnet by zihangdai

0.0%
6k
Language model research paper using generalized autoregressive pretraining
created 6 years ago
updated 2 years ago
Feedback? Help us improve.