MASS  by microsoft

Pre-training method for sequence-to-sequence language generation tasks

Created 6 years ago
1,120 stars

Top 34.3% on SourcePulse

GitHubView on GitHub
Project Summary

MASS is a pre-training method for sequence-to-sequence language generation tasks, targeting researchers and practitioners in natural language processing. It enhances performance on tasks like Neural Machine Translation (NMT) and text summarization by masking sequence fragments for encoder-decoder prediction.

How It Works

MASS employs a masked sequence-to-sequence objective. A contiguous span of tokens in the input sequence is masked for the encoder. The decoder is then tasked with predicting the masked span, conditioned on the encoder's output and the unmasked input tokens. This approach forces the model to learn dependencies between the unmasked and masked parts of the sequence, improving generation quality.

Quick Start & Requirements

  • Installation: Requires fairseq (version 0.7.1 for unsupervised NMT, 0.8.0 for summarization) and PyTorch (0.4 or 1.0). fastBPE and Moses are needed for tokenization. Apex is recommended for FP16 training.
  • Data: Requires specific data preparation steps, including tokenization and binarization using fairseq-preprocess. Links to example data download scripts and pre-trained models are provided.
  • Resources: Training and fine-tuning involve significant computational resources, including multiple GPUs, as indicated by distributed training examples.

Highlighted Details

  • Supports unsupervised NMT, supervised NMT, and text summarization.
  • Provides pre-trained models for various language pairs (e.g., EN-FR, EN-DE, EN-RO, Zh-En).
  • Achieves competitive results, e.g., 39.1 BLEU on Ro-En NMT with back-translation.
  • Codebase is built upon Fairseq, offering flexibility and integration with its ecosystem.

Maintenance & Community

The project is from Microsoft Research. Related work includes MPNet. Links to GitHub repositories for MASS and MPNet are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project requires specific, older versions of PyTorch and Fairseq, which may pose compatibility challenges with current environments. The setup and data preparation steps are detailed but complex, requiring careful execution.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Lukas Biewald Lukas Biewald(Cofounder of Weights & Biases), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

DialoGPT by microsoft

0.1%
2k
Response generation model via large-scale pretraining
Created 6 years ago
Updated 2 years ago
Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), and
17 more.

pytext by facebookresearch

0%
6k
NLP framework (deprecated, migrate to torchtext)
Created 7 years ago
Updated 2 years ago
Feedback? Help us improve.