Pre-training method for sequence-to-sequence language generation tasks
Top 34.9% on sourcepulse
MASS is a pre-training method for sequence-to-sequence language generation tasks, targeting researchers and practitioners in natural language processing. It enhances performance on tasks like Neural Machine Translation (NMT) and text summarization by masking sequence fragments for encoder-decoder prediction.
How It Works
MASS employs a masked sequence-to-sequence objective. A contiguous span of tokens in the input sequence is masked for the encoder. The decoder is then tasked with predicting the masked span, conditioned on the encoder's output and the unmasked input tokens. This approach forces the model to learn dependencies between the unmasked and masked parts of the sequence, improving generation quality.
Quick Start & Requirements
fairseq
(version 0.7.1 for unsupervised NMT, 0.8.0 for summarization) and PyTorch
(0.4 or 1.0). fastBPE
and Moses
are needed for tokenization. Apex
is recommended for FP16 training.fairseq-preprocess
. Links to example data download scripts and pre-trained models are provided.Highlighted Details
Maintenance & Community
The project is from Microsoft Research. Related work includes MPNet. Links to GitHub repositories for MASS and MPNet are provided.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project requires specific, older versions of PyTorch and Fairseq, which may pose compatibility challenges with current environments. The setup and data preparation steps are detailed but complex, requiring careful execution.
2 years ago
Inactive