geval by nlpyang

GPT-4-based evaluation code for NLG, per research paper

Created 2 years ago

402 stars

Top 72.1% on SourcePulse

Project Summary

G-Eval provides a framework for evaluating Natural Language Generation (NLG) outputs using GPT-4, aiming for improved alignment with human judgment. It is designed for researchers and practitioners in NLP who need a robust and scalable method for assessing the quality of generated text. The primary benefit is a more reliable and consistent evaluation metric compared to traditional automated metrics.

How It Works

G-Eval leverages GPT-4's advanced reasoning capabilities by framing evaluation as a series of targeted questions or prompts. This approach allows GPT-4 to assess specific aspects of NLG quality, such as fluency, coherence, or relevance, in a detailed and nuanced manner. The method is advantageous as it moves beyond simple n-gram overlap to capture semantic and contextual understanding, mimicking human evaluators more closely.

Quick Start & Requirements

Primary install / run command: python gpt4_eval.py --prompt <path_to_prompt> --save_fp <output_path> --summeval_fp <data_path> --key <your_openai_api_key>
Prerequisites: Python, OpenAI API key.
Links: Paper

Highlighted Details

Evaluates fluency on the SummEval dataset.
Provides meta-evaluation scripts to analyze GPT-4's assessment results.
Prompts for SummEval evaluation are located in prompts/summeval.
Results for SummEval are stored in the results directory.

Maintenance & Community

The project is associated with the paper "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment." Further community or maintenance details are not explicitly provided in the README.

Licensing & Compatibility

The README does not specify a license. Users should assume all rights are reserved or contact the authors for clarification. Compatibility for commercial use or closed-source linking is not stated.

Limitations & Caveats

The project requires an OpenAI API key, which incurs costs. The effectiveness is dependent on the quality and specificity of the prompts used for GPT-4 evaluation. The current implementation focuses on specific datasets and evaluation dimensions, and may require adaptation for other use cases.

geval by nlpyang

Explore Similar Projects

prometheus by prometheus-eval

RGB by chen700564

SummEval by Yale-LILY

Awesome-LLMs-Evaluation-Papers by tjunlp-lab

fmeval by aws

PandaLM by WeOpenML

ARES by stanford-futuredata

TruthfulQA by sylinrl

evaluation-guidebook by huggingface

chain-of-thought-hub by FranxYao

summarize-from-feedback by openai

MT-Reading-List by THUNLP-MT