mlm-scoring  by awslabs

Library for masked language model scoring (ACL 2020 paper)

Created 5 years ago
346 stars

Top 80.1% on SourcePulse

GitHubView on GitHub
Project Summary

This Python library and accompanying examples enable scoring sentences and rescoring n-best lists using Masked Language Models (MLMs) like BERT and RoBERTa, as well as autoregressive models like GPT-2. It targets researchers and practitioners in speech recognition, machine translation, and linguistic acceptability, offering improved language model integration for these tasks.

How It Works

The library computes pseudo-log-likelihood (PLL) scores by masking individual words within sentences and leveraging the predictive capabilities of MLMs. It also supports direct log-probability scoring for autoregressive models. This approach allows for unsupervised ranking and rescoring of hypotheses, providing a flexible way to integrate powerful pre-trained language models into various NLP pipelines.

Quick Start & Requirements

  • Install via pip: pip install -e .
  • Dependencies: Python 3.6+, PyTorch, MXNet (with CUDA support recommended, e.g., mxnet-cu102mkl).
  • Usage examples and detailed documentation are available in the repository.

Highlighted Details

  • Supports scoring with BERT, RoBERTa, XLM, ALBERT, DistilBERT (PLL), and GPT-2 (log-probability).
  • Includes functionality for "maskless" PLL scoring and rescoring n-best lists via log-linear interpolation.
  • Demonstrates use cases in Speech Recognition (ESPnet LAS), Machine Translation (Transformer NMT), and Linguistic Acceptability (BLiMP).
  • Achieved a WER reduction from 12.2% to 8.5% in a LibriSpeech ASR rescoring example.

Maintenance & Community

The project originates from AWS Labs. Further community engagement details (e.g., Discord, Slack, roadmap) are not explicitly mentioned in the README.

Licensing & Compatibility

The project is released under the Apache License 2.0, which permits commercial use and integration with closed-source projects.

Limitations & Caveats

The PyTorch interface is marked as experimental. The installation requires specific MXNet versions tied to CUDA versions, which may require careful environment management.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Eugene Yan Eugene Yan(AI Scientist at AWS), and
14 more.

text by pytorch

0.0%
4k
PyTorch library for NLP tasks
Created 8 years ago
Updated 1 week ago
Feedback? Help us improve.