long-form-factuality by google-deepmind

Benchmark for long-form factuality in LLMs

Created 1 year ago

663 stars

Top 50.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Travis Fischer

Founder of Agentic

Project Summary

This repository provides tools and a dataset for evaluating the factuality of long-form responses generated by large language models. It is intended for researchers and developers working on LLM factuality and aims to offer a standardized benchmark and evaluation framework.

How It Works

The project introduces LongFact, a dataset of 2,280 fact-seeking prompts requiring detailed, long-form answers. It also presents the Search-Augmented Factuality Evaluator (SAFE), an automated system designed to assess the factual accuracy of these long-form responses. SAFE extends the F1 score metric to long-form content by considering recall relative to human-preferred response lengths.

Quick Start & Requirements

Install: Clone the repository, create a Python 3.10+ conda environment, activate it, and run pip install -r requirements.txt.
Prerequisites: Python 3.10+, Conda. API keys for OpenAI and Anthropic models are required for benchmarking.
Usage:
- LongFact dataset: longfact/
- Data generation pipeline: python -m data_creation.pipeline
- SAFE evaluation: python -m eval.safe
- Benchmarking models: python -m main.pipeline and python -m eval.run_eval
Links: Paper

Highlighted Details

Benchmarks OpenAI and Anthropic models.
Includes a dataset (LongFact) with 2,280 fact-seeking prompts.
Features an automated evaluation framework (SAFE) for long-form factuality.
Extends F1 score to long-form settings with F1@K.

Maintenance & Community

This project is from Google DeepMind. No specific community channels or roadmap are detailed in the README.

Licensing & Compatibility

Software is licensed under Apache License, Version 2.0. Other materials are licensed under Creative Commons Attribution 4.0 International (CC-BY). Apache 2.0 is permissive for commercial use and closed-source linking. CC-BY requires attribution.

Limitations & Caveats

The primary focus is on benchmarking specific OpenAI and Anthropic models. The README does not detail support for other LLMs or custom model integration. API keys are required for benchmarking, which may incur costs.

Health Check

Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days