freshqa by freshllms

Dataset and code for refreshing LLMs with search

Created 2 years ago

386 stars

Top 74.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Travis Fischer

Founder of Agentic

Project Summary

This repository provides the dataset and code for FreshLLMs, a method for refreshing Large Language Models (LLMs) with search engine augmentation. It is relevant for LLM researchers and developers aiming to improve model factuality and up-to-dateness, offering a structured approach to data collection and evaluation.

How It Works

The project centers around the FreshQA dataset, a continuously updated collection of questions and answers designed to evaluate LLM factuality. It also introduces FreshEval, an automatic evaluation metric that leverages few-shot in-context learning with LLMs to assess response quality, aiming to mimic human judgment for factuality.

Quick Start & Requirements

FreshQA Dataset: Access via Google Sheets or download as CSV. Weekly updates are provided.
FreshEval: Requires Google Colab notebooks, a Google Drive account for data storage, and API access to LLMs (e.g., GPT-4).
Dependencies: Python, Google Colab, LLM APIs.

Highlighted Details

The FreshQA dataset has inspired or been used in major LLMs like Google Gemini and Perplexity.AI's Online LLMs.
FreshEval metric demonstrates high agreement with human raters for evaluating LLM factuality.
The project offers both "Relaxed" and "Strict" evaluation modes for FreshEval.
Weekly dataset updates are provided, with mechanisms for community contribution.

Maintenance & Community

The project acknowledges several contributors for both dataset updates and original creation. SerpApi is a sponsor, providing search credits for FreshPrompt users.

Licensing & Compatibility

The repository does not explicitly state a license. The provided citation is for an arXiv paper. Commercial use implications are not detailed.

Limitations & Caveats

The FreshEval metric's accuracy is dependent on the chosen LLM and its API access. The README notes that gpt-4-1106-preview is recommended over gpt-4-0125-preview for FreshEval due to slightly better agreement with human annotations in their evaluation.

Health Check

Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days