geval  by nlpyang

GPT-4-based evaluation code for NLG, per research paper

created 2 years ago
366 stars

Top 78.1% on sourcepulse

GitHubView on GitHub
Project Summary

G-Eval provides a framework for evaluating Natural Language Generation (NLG) outputs using GPT-4, aiming for improved alignment with human judgment. It is designed for researchers and practitioners in NLP who need a robust and scalable method for assessing the quality of generated text. The primary benefit is a more reliable and consistent evaluation metric compared to traditional automated metrics.

How It Works

G-Eval leverages GPT-4's advanced reasoning capabilities by framing evaluation as a series of targeted questions or prompts. This approach allows GPT-4 to assess specific aspects of NLG quality, such as fluency, coherence, or relevance, in a detailed and nuanced manner. The method is advantageous as it moves beyond simple n-gram overlap to capture semantic and contextual understanding, mimicking human evaluators more closely.

Quick Start & Requirements

  • Primary install / run command: python gpt4_eval.py --prompt <path_to_prompt> --save_fp <output_path> --summeval_fp <data_path> --key <your_openai_api_key>
  • Prerequisites: Python, OpenAI API key.
  • Links: Paper

Highlighted Details

  • Evaluates fluency on the SummEval dataset.
  • Provides meta-evaluation scripts to analyze GPT-4's assessment results.
  • Prompts for SummEval evaluation are located in prompts/summeval.
  • Results for SummEval are stored in the results directory.

Maintenance & Community

The project is associated with the paper "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment." Further community or maintenance details are not explicitly provided in the README.

Licensing & Compatibility

The README does not specify a license. Users should assume all rights are reserved or contact the authors for clarification. Compatibility for commercial use or closed-source linking is not stated.

Limitations & Caveats

The project requires an OpenAI API key, which incurs costs. The effectiveness is dependent on the quality and specificity of the prompts used for GPT-4 evaluation. The current implementation focuses on specific datasets and evaluation dimensions, and may require adaptation for other use cases.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
37 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.