semantic_uncertainty  by jlko

Code for reproducing semantic uncertainty research paper experiments

created 1 year ago
349 stars

Top 80.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides code to reproduce experiments on detecting hallucinations in Large Language Models (LLMs) using semantic entropy. It targets researchers and practitioners working with LLMs who need to evaluate and mitigate model-generated inaccuracies. The primary benefit is a reproducible framework for quantifying and identifying LLM hallucinations.

How It Works

The project implements a semantic entropy metric to measure uncertainty in LLM outputs. It samples responses and their likelihoods/hidden states from various LLMs across different datasets. Uncertainty measures are then computed from these outputs, followed by an analysis of aggregate performance metrics. This approach allows for a quantitative assessment of an LLM's tendency to "hallucinate" or generate factually incorrect information.

Quick Start & Requirements

  • Install: conda env update -f environment.yaml followed by conda activate semantic_uncertainty.
  • Prerequisites: Python 3.11, PyTorch 2.1, Conda, Weights & Biases (wandb) account and API key, Hugging Face account and token (potentially requiring access for LLaMa-2 models), OpenAI API key for sentence-length experiments.
  • Hardware: Modern CPU (e.g., Intel 10th gen), 16GB+ RAM. Crucially, one or more NVIDIA GPUs are required for feasible runtime. GPU memory requirements scale with LLM size (7B models: 24GB GPU; 13B models: 80GB GPU; 70B models: 2x80GB GPUs). float16 or int8 precision can reduce memory needs.
  • Setup Time: Approximately 15 minutes for environment setup.
  • Links: Conda, Weights & Biases, Hugging Face, OpenAI API.

Highlighted Details

  • Reproduces experiments from a Nature submission on semantic uncertainty.
  • Supports multiple LLMs (Llama-2, Falcon, Mistral) and datasets (TriviaQA, SQuAD, BioASQ, NQ).
  • Includes scripts for response generation, uncertainty computation, and result analysis.
  • Demo run on Llama-2 Chat (7B) on TriviaQA is estimated at 1 hour with specific hardware.

Maintenance & Community

The repository builds upon a previous, now deprecated, codebase. No specific community channels or active maintenance signals are mentioned in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility with commercial or closed-source projects is not discussed.

Limitations & Caveats

The project relies heavily on specific versions of Python and PyTorch, and the README advises against using the exact environment export. Reproducing sentence-length experiments requires using the OpenAI API, incurring costs. Manual data download is required for the BioASQ dataset.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
39 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Travis Fischer Travis Fischer(Founder of Agentic).

HaluEval by RUCAIBox

0%
497
Benchmark dataset for LLM hallucination evaluation
created 2 years ago
updated 1 year ago
Feedback? Help us improve.