NLG metric based on transfer learning
Top 47.5% on sourcepulse
BLEURT is a Python library and command-line tool for evaluating Natural Language Generation (NLG) outputs. It provides a learned metric, based on BERT and RemBERT, that scores candidate sentences against references, aiming to capture fluency and meaning preservation. This is beneficial for researchers and developers needing robust NLG evaluation beyond traditional metrics like BLEU.
How It Works
BLEURT is a regression model trained on human ratings of sentence pairs. It leverages transfer learning from large language models (BERT, RemBERT) to understand semantic similarity and fluency. This approach allows it to learn nuanced quality judgments, outperforming simpler metrics by capturing more complex linguistic phenomena.
Quick Start & Requirements
pip install --upgrade pip && git clone https://github.com/google-research/bleurt.git && cd bleurt && pip install .
python -m bleurt.score_files -candidate_file=... -reference_file=... -bleurt_checkpoint=BLEURT-20
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The default "test" checkpoint is noted as inaccurate; users should download recommended checkpoints. While BLEURT-20 supports multiple languages, its performance on languages not explicitly tested may vary. The distinction between adequacy and fluency in its scoring can be fuzzy due to training data characteristics.
2 years ago
Inactive