long-form-factuality  by google-deepmind

Benchmark for long-form factuality in LLMs

created 1 year ago
627 stars

Top 53.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides tools and a dataset for evaluating the factuality of long-form responses generated by large language models. It is intended for researchers and developers working on LLM factuality and aims to offer a standardized benchmark and evaluation framework.

How It Works

The project introduces LongFact, a dataset of 2,280 fact-seeking prompts requiring detailed, long-form answers. It also presents the Search-Augmented Factuality Evaluator (SAFE), an automated system designed to assess the factual accuracy of these long-form responses. SAFE extends the F1 score metric to long-form content by considering recall relative to human-preferred response lengths.

Quick Start & Requirements

  • Install: Clone the repository, create a Python 3.10+ conda environment, activate it, and run pip install -r requirements.txt.
  • Prerequisites: Python 3.10+, Conda. API keys for OpenAI and Anthropic models are required for benchmarking.
  • Usage:
    • LongFact dataset: longfact/
    • Data generation pipeline: python -m data_creation.pipeline
    • SAFE evaluation: python -m eval.safe
    • Benchmarking models: python -m main.pipeline and python -m eval.run_eval
  • Links: Paper

Highlighted Details

  • Benchmarks OpenAI and Anthropic models.
  • Includes a dataset (LongFact) with 2,280 fact-seeking prompts.
  • Features an automated evaluation framework (SAFE) for long-form factuality.
  • Extends F1 score to long-form settings with F1@K.

Maintenance & Community

This project is from Google DeepMind. No specific community channels or roadmap are detailed in the README.

Licensing & Compatibility

Software is licensed under Apache License, Version 2.0. Other materials are licensed under Creative Commons Attribution 4.0 International (CC-BY). Apache 2.0 is permissive for commercial use and closed-source linking. CC-BY requires attribution.

Limitations & Caveats

The primary focus is on benchmarking specific OpenAI and Anthropic models. The README does not detail support for other LLMs or custom model integration. API keys are required for benchmarking, which may incur costs.

Health Check
Last commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
0
Star History
24 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.