mlm-scoring by awslabs

Library for masked language model scoring (ACL 2020 paper)

Created 5 years ago

347 stars

Top 80.0% on SourcePulse

Project Summary

This Python library and accompanying examples enable scoring sentences and rescoring n-best lists using Masked Language Models (MLMs) like BERT and RoBERTa, as well as autoregressive models like GPT-2. It targets researchers and practitioners in speech recognition, machine translation, and linguistic acceptability, offering improved language model integration for these tasks.

How It Works

The library computes pseudo-log-likelihood (PLL) scores by masking individual words within sentences and leveraging the predictive capabilities of MLMs. It also supports direct log-probability scoring for autoregressive models. This approach allows for unsupervised ranking and rescoring of hypotheses, providing a flexible way to integrate powerful pre-trained language models into various NLP pipelines.

Quick Start & Requirements

Install via pip: pip install -e .
Dependencies: Python 3.6+, PyTorch, MXNet (with CUDA support recommended, e.g., mxnet-cu102mkl).
Usage examples and detailed documentation are available in the repository.

Highlighted Details

Supports scoring with BERT, RoBERTa, XLM, ALBERT, DistilBERT (PLL), and GPT-2 (log-probability).
Includes functionality for "maskless" PLL scoring and rescoring n-best lists via log-linear interpolation.
Demonstrates use cases in Speech Recognition (ESPnet LAS), Machine Translation (Transformer NMT), and Linguistic Acceptability (BLiMP).
Achieved a WER reduction from 12.2% to 8.5% in a LibriSpeech ASR rescoring example.

Maintenance & Community

The project originates from AWS Labs. Further community engagement details (e.g., Discord, Slack, roadmap) are not explicitly mentioned in the README.

Licensing & Compatibility

The project is released under the Apache License 2.0, which permits commercial use and integration with closed-source projects.

Limitations & Caveats

The PyTorch interface is marked as experimental. The installation requires specific MXNet versions tied to CUDA versions, which may require careful environment management.

mlm-scoring by awslabs

Explore Similar Projects

parsbert by hooshvare

Kevinpro-NLP-demo by Ricardokevins

KLUE by KLUE-benchmark

finetune by IndicoDataSolutions

nlp-notebook by jasoncao11

nlp-paper by changwookjun

NLP-Tutorials by MorvanZhou

MT-Reading-List by THUNLP-MT

Pretrained-Language-Model by huawei-noah

text by pytorch

text_classification by brightmart

unilm by microsoft