GPTScore by jinlanfu

Evaluation framework for generated text (research paper)

Created 2 years ago

259 stars

Top 97.9% on SourcePulse

Project Summary

GPTScore is a framework for evaluating generated text using pre-trained language models (PLMs) as evaluators. It allows for customizable, multifaceted, and training-free assessments of text quality, making it suitable for researchers and developers working with generative AI models.

How It Works

GPTScore leverages the emergent instruction-following capabilities of large PLMs to score generated text based on specified criteria. It supports evaluations with or without custom instructions and demonstrations, offering flexibility in defining evaluation aspects like "quality." The framework supports a wide range of PLMs, from smaller models like FLAN-T5-Small to large ones like GPT-3 (175B parameters).

Quick Start & Requirements

Install/Run: Execute Python scripts like score_d2t.py.
Prerequisites: Requires Python and specific PLMs (e.g., GPT-3, OPT, FLAN-T5, GPT-2, GPT-J). Access to large models like GPT-3 may require API keys or significant local resources.
Setup: No explicit setup time or resource footprint is detailed, but running large models implies substantial computational requirements.

Highlighted Details

Supports 19 PLMs ranging from 80M to 175B parameters.
Enables customizable evaluation through instructions and demonstrations.
Offers multifaceted evaluation capabilities with a single evaluator.
Operates without requiring additional training for the evaluator models.

Maintenance & Community

The project is associated with the paper "GPTScore: Evaluate as You Desire" by Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu.
No community links (Discord, Slack) or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The project is presented as source code for a research paper.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The framework's effectiveness is dependent on the capabilities of the chosen PLM evaluator. The README does not detail specific performance benchmarks or potential biases introduced by the evaluator models. Accessing and running the largest supported models (e.g., GPT-3 175B) will require significant computational resources or API access.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days