mlm-scoring  by awslabs

Library for masked language model scoring (ACL 2020 paper)

created 5 years ago
344 stars

Top 81.6% on sourcepulse

GitHubView on GitHub
Project Summary

This Python library and accompanying examples enable scoring sentences and rescoring n-best lists using Masked Language Models (MLMs) like BERT and RoBERTa, as well as autoregressive models like GPT-2. It targets researchers and practitioners in speech recognition, machine translation, and linguistic acceptability, offering improved language model integration for these tasks.

How It Works

The library computes pseudo-log-likelihood (PLL) scores by masking individual words within sentences and leveraging the predictive capabilities of MLMs. It also supports direct log-probability scoring for autoregressive models. This approach allows for unsupervised ranking and rescoring of hypotheses, providing a flexible way to integrate powerful pre-trained language models into various NLP pipelines.

Quick Start & Requirements

  • Install via pip: pip install -e .
  • Dependencies: Python 3.6+, PyTorch, MXNet (with CUDA support recommended, e.g., mxnet-cu102mkl).
  • Usage examples and detailed documentation are available in the repository.

Highlighted Details

  • Supports scoring with BERT, RoBERTa, XLM, ALBERT, DistilBERT (PLL), and GPT-2 (log-probability).
  • Includes functionality for "maskless" PLL scoring and rescoring n-best lists via log-linear interpolation.
  • Demonstrates use cases in Speech Recognition (ESPnet LAS), Machine Translation (Transformer NMT), and Linguistic Acceptability (BLiMP).
  • Achieved a WER reduction from 12.2% to 8.5% in a LibriSpeech ASR rescoring example.

Maintenance & Community

The project originates from AWS Labs. Further community engagement details (e.g., Discord, Slack, roadmap) are not explicitly mentioned in the README.

Licensing & Compatibility

The project is released under the Apache License 2.0, which permits commercial use and integration with closed-source projects.

Limitations & Caveats

The PyTorch interface is marked as experimental. The installation requires specific MXNet versions tied to CUDA versions, which may require careful environment management.

Health Check
Last commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.