freshqa  by freshllms

Dataset and code for refreshing LLMs with search

created 1 year ago
368 stars

Top 77.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides the dataset and code for FreshLLMs, a method for refreshing Large Language Models (LLMs) with search engine augmentation. It is relevant for LLM researchers and developers aiming to improve model factuality and up-to-dateness, offering a structured approach to data collection and evaluation.

How It Works

The project centers around the FreshQA dataset, a continuously updated collection of questions and answers designed to evaluate LLM factuality. It also introduces FreshEval, an automatic evaluation metric that leverages few-shot in-context learning with LLMs to assess response quality, aiming to mimic human judgment for factuality.

Quick Start & Requirements

  • FreshQA Dataset: Access via Google Sheets or download as CSV. Weekly updates are provided.
  • FreshEval: Requires Google Colab notebooks, a Google Drive account for data storage, and API access to LLMs (e.g., GPT-4).
  • Dependencies: Python, Google Colab, LLM APIs.

Highlighted Details

  • The FreshQA dataset has inspired or been used in major LLMs like Google Gemini and Perplexity.AI's Online LLMs.
  • FreshEval metric demonstrates high agreement with human raters for evaluating LLM factuality.
  • The project offers both "Relaxed" and "Strict" evaluation modes for FreshEval.
  • Weekly dataset updates are provided, with mechanisms for community contribution.

Maintenance & Community

The project acknowledges several contributors for both dataset updates and original creation. SerpApi is a sponsor, providing search credits for FreshPrompt users.

Licensing & Compatibility

The repository does not explicitly state a license. The provided citation is for an arXiv paper. Commercial use implications are not detailed.

Limitations & Caveats

The FreshEval metric's accuracy is dependent on the chosen LLM and its API access. The README notes that gpt-4-1106-preview is recommended over gpt-4-0125-preview for FreshEval due to slightly better agreement with human annotations in their evaluation.

Health Check
Last commit

6 days ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.