long-form-factuality  by google-deepmind

Benchmark for long-form factuality in LLMs

Created 1 year ago
640 stars

Top 51.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides tools and a dataset for evaluating the factuality of long-form responses generated by large language models. It is intended for researchers and developers working on LLM factuality and aims to offer a standardized benchmark and evaluation framework.

How It Works

The project introduces LongFact, a dataset of 2,280 fact-seeking prompts requiring detailed, long-form answers. It also presents the Search-Augmented Factuality Evaluator (SAFE), an automated system designed to assess the factual accuracy of these long-form responses. SAFE extends the F1 score metric to long-form content by considering recall relative to human-preferred response lengths.

Quick Start & Requirements

  • Install: Clone the repository, create a Python 3.10+ conda environment, activate it, and run pip install -r requirements.txt.
  • Prerequisites: Python 3.10+, Conda. API keys for OpenAI and Anthropic models are required for benchmarking.
  • Usage:
    • LongFact dataset: longfact/
    • Data generation pipeline: python -m data_creation.pipeline
    • SAFE evaluation: python -m eval.safe
    • Benchmarking models: python -m main.pipeline and python -m eval.run_eval
  • Links: Paper

Highlighted Details

  • Benchmarks OpenAI and Anthropic models.
  • Includes a dataset (LongFact) with 2,280 fact-seeking prompts.
  • Features an automated evaluation framework (SAFE) for long-form factuality.
  • Extends F1 score to long-form settings with F1@K.

Maintenance & Community

This project is from Google DeepMind. No specific community channels or roadmap are detailed in the README.

Licensing & Compatibility

Software is licensed under Apache License, Version 2.0. Other materials are licensed under Creative Commons Attribution 4.0 International (CC-BY). Apache 2.0 is permissive for commercial use and closed-source linking. CC-BY requires attribution.

Limitations & Caveats

The primary focus is on benchmarking specific OpenAI and Anthropic models. The README does not detail support for other LLMs or custom model integration. API keys are required for benchmarking, which may incur costs.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Didier Lopes Didier Lopes(Founder of OpenBB), and
2 more.

RULER by NVIDIA

0.8%
1k
Evaluation suite for long-context language models research paper
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.