MASS by microsoft

Pre-training method for sequence-to-sequence language generation tasks

Created 6 years ago

1,122 stars

Top 34.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Guillaume Lample

Cofounder of Mistral

Abhishek Thakur

World's First 4x Kaggle GrandMaster

Project Summary

MASS is a pre-training method for sequence-to-sequence language generation tasks, targeting researchers and practitioners in natural language processing. It enhances performance on tasks like Neural Machine Translation (NMT) and text summarization by masking sequence fragments for encoder-decoder prediction.

How It Works

MASS employs a masked sequence-to-sequence objective. A contiguous span of tokens in the input sequence is masked for the encoder. The decoder is then tasked with predicting the masked span, conditioned on the encoder's output and the unmasked input tokens. This approach forces the model to learn dependencies between the unmasked and masked parts of the sequence, improving generation quality.

Quick Start & Requirements

Installation: Requires fairseq (version 0.7.1 for unsupervised NMT, 0.8.0 for summarization) and PyTorch (0.4 or 1.0). fastBPE and Moses are needed for tokenization. Apex is recommended for FP16 training.
Data: Requires specific data preparation steps, including tokenization and binarization using fairseq-preprocess. Links to example data download scripts and pre-trained models are provided.
Resources: Training and fine-tuning involve significant computational resources, including multiple GPUs, as indicated by distributed training examples.

Highlighted Details

Supports unsupervised NMT, supervised NMT, and text summarization.
Provides pre-trained models for various language pairs (e.g., EN-FR, EN-DE, EN-RO, Zh-En).
Achieves competitive results, e.g., 39.1 BLEU on Ro-En NMT with back-translation.
Codebase is built upon Fairseq, offering flexibility and integration with its ecosystem.

Maintenance & Community

The project is from Microsoft Research. Related work includes MPNet. Links to GitHub repositories for MASS and MPNet are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project requires specific, older versions of PyTorch and Fairseq, which may pose compatibility challenges with current environments. The setup and data preparation steps are detailed but complex, requiring careful execution.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days