Benchmark for long-form factuality in LLMs
Top 53.6% on sourcepulse
This repository provides tools and a dataset for evaluating the factuality of long-form responses generated by large language models. It is intended for researchers and developers working on LLM factuality and aims to offer a standardized benchmark and evaluation framework.
How It Works
The project introduces LongFact, a dataset of 2,280 fact-seeking prompts requiring detailed, long-form answers. It also presents the Search-Augmented Factuality Evaluator (SAFE), an automated system designed to assess the factual accuracy of these long-form responses. SAFE extends the F1 score metric to long-form content by considering recall relative to human-preferred response lengths.
Quick Start & Requirements
pip install -r requirements.txt
.longfact/
python -m data_creation.pipeline
python -m eval.safe
python -m main.pipeline
and python -m eval.run_eval
Highlighted Details
Maintenance & Community
This project is from Google DeepMind. No specific community channels or roadmap are detailed in the README.
Licensing & Compatibility
Software is licensed under Apache License, Version 2.0. Other materials are licensed under Creative Commons Attribution 4.0 International (CC-BY). Apache 2.0 is permissive for commercial use and closed-source linking. CC-BY requires attribution.
Limitations & Caveats
The primary focus is on benchmarking specific OpenAI and Anthropic models. The README does not detail support for other LLMs or custom model integration. API keys are required for benchmarking, which may incur costs.
2 weeks ago
Inactive