GPTScore  by jinlanfu

Evaluation framework for generated text (research paper)

created 2 years ago
253 stars

Top 99.5% on sourcepulse

GitHubView on GitHub
Project Summary

GPTScore is a framework for evaluating generated text using pre-trained language models (PLMs) as evaluators. It allows for customizable, multifaceted, and training-free assessments of text quality, making it suitable for researchers and developers working with generative AI models.

How It Works

GPTScore leverages the emergent instruction-following capabilities of large PLMs to score generated text based on specified criteria. It supports evaluations with or without custom instructions and demonstrations, offering flexibility in defining evaluation aspects like "quality." The framework supports a wide range of PLMs, from smaller models like FLAN-T5-Small to large ones like GPT-3 (175B parameters).

Quick Start & Requirements

  • Install/Run: Execute Python scripts like score_d2t.py.
  • Prerequisites: Requires Python and specific PLMs (e.g., GPT-3, OPT, FLAN-T5, GPT-2, GPT-J). Access to large models like GPT-3 may require API keys or significant local resources.
  • Setup: No explicit setup time or resource footprint is detailed, but running large models implies substantial computational requirements.

Highlighted Details

  • Supports 19 PLMs ranging from 80M to 175B parameters.
  • Enables customizable evaluation through instructions and demonstrations.
  • Offers multifaceted evaluation capabilities with a single evaluator.
  • Operates without requiring additional training for the evaluator models.

Maintenance & Community

  • The project is associated with the paper "GPTScore: Evaluate as You Desire" by Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu.
  • No community links (Discord, Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The project is presented as source code for a research paper.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The framework's effectiveness is dependent on the capabilities of the chosen PLM evaluator. The README does not detail specific performance benchmarks or potential biases introduced by the evaluator models. Accessing and running the largest supported models (e.g., GPT-3 175B) will require significant computational resources or API access.

Health Check
Last commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
1 more.

yarn by jquesnelle

1.0%
2k
Context window extension method for LLMs (research paper, models)
created 2 years ago
updated 1 year ago
Feedback? Help us improve.