PyTorch library for evaluating neural text generation quality
Top 90.9% on sourcepulse
MAUVE is a Python package designed to quantify the distributional divergence between generated text and human text, addressing the need for robust evaluation metrics in natural language generation. It is particularly useful for researchers and practitioners in NLP who need to assess the quality and similarity of text produced by language models compared to human-written text. The primary benefit is its ability to capture nuanced differences that simpler metrics might miss, offering a more comprehensive understanding of model performance.
How It Works
MAUVE computes similarity by leveraging Kullback–Leibler (KL) divergences within a quantized embedding space derived from a large language model (LLM), typically GPT-2. It quantizes text representations using k-means clustering, with adaptive hyperparameter selection for this process. The approach allows for flexibility by accepting raw text, pre-computed features (e.g., LLM hidden states), or tokenized inputs, making it adaptable to various workflows and even other modalities like images or audio.
Quick Start & Requirements
pip install mauve-text
torch>=1.1.0
and transformers>=3.2.0
are required.Highlighted Details
Maintenance & Community
The project is actively maintained, with contributions encouraged via GitHub issues and pull requests. The primary contact method is through GitHub issues.
Licensing & Compatibility
The project does not explicitly state a license in the README. This requires further investigation for commercial use or closed-source linking.
Limitations & Caveats
MAUVE is best suited for relative comparisons; absolute scores can vary with hyperparameters. The metric requires a substantial number of samples (thousands recommended) for reliable results, as fewer samples can lead to optimistic and high-variance scores. The runtime can be significant, especially with a large number of clusters, and can be mitigated by adjusting clustering hyperparameters at the cost of accuracy.
1 year ago
Inactive