bleurt by google-research

NLG metric based on transfer learning

Created 5 years ago

779 stars

Top 45.1% on SourcePulse

View on GitHub

4 Experts Love This Project

Cofounder of Hugging Face

Project Summary

BLEURT is a Python library and command-line tool for evaluating Natural Language Generation (NLG) outputs. It provides a learned metric, based on BERT and RemBERT, that scores candidate sentences against references, aiming to capture fluency and meaning preservation. This is beneficial for researchers and developers needing robust NLG evaluation beyond traditional metrics like BLEU.

How It Works

BLEURT is a regression model trained on human ratings of sentence pairs. It leverages transfer learning from large language models (BERT, RemBERT) to understand semantic similarity and fluency. This approach allows it to learn nuanced quality judgments, outperforming simpler metrics by capturing more complex linguistic phenomena.

Quick Start & Requirements

Install via pip: pip install --upgrade pip && git clone https://github.com/google-research/bleurt.git && cd bleurt && pip install .
Requires Python 3, TensorFlow (>=1.15), and tf-slim (>=1.1).
Recommended checkpoint (BLEURT-20) requires downloading a ~1GB zip file.
Usage examples: python -m bleurt.score_files -candidate_file=... -reference_file=... -bleurt_checkpoint=BLEURT-20
Official blog post: https://ai.googleblog.com/2020/07/bleurt-learning-robust-metrics-for-text.html
Papers: ACL https://arxiv.org/abs/2004.04696, EMNLP https://arxiv.org/abs/2109.04700

Highlighted Details

Offers command-line, Python API, and TensorFlow API interfaces.
Supports fine-tuning on custom rating data for domain-specific evaluation.
BLEURT-20 checkpoint supports 13 languages and is multilingual.
Includes methods for speeding up inference, such as batching and distilled models.

Maintenance & Community

Developed by Google Research.
Latest recommended checkpoint (BLEURT-20) released Oct 2021.
Reproducibility details for papers are available.

Licensing & Compatibility

Apache 2.0 License.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

The default "test" checkpoint is noted as inaccurate; users should download recommended checkpoints. While BLEURT-20 supports multiple languages, its performance on languages not explicitly tested may vary. The distinction between adequacy and fluency in its scoring can be fuzzy due to training data characteristics.

Health Check

Last Commit

2 years ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days