semantic_uncertainty  by jlko

Code for reproducing semantic uncertainty research paper experiments

Created 1 year ago
361 stars

Top 77.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides code to reproduce experiments on detecting hallucinations in Large Language Models (LLMs) using semantic entropy. It targets researchers and practitioners working with LLMs who need to evaluate and mitigate model-generated inaccuracies. The primary benefit is a reproducible framework for quantifying and identifying LLM hallucinations.

How It Works

The project implements a semantic entropy metric to measure uncertainty in LLM outputs. It samples responses and their likelihoods/hidden states from various LLMs across different datasets. Uncertainty measures are then computed from these outputs, followed by an analysis of aggregate performance metrics. This approach allows for a quantitative assessment of an LLM's tendency to "hallucinate" or generate factually incorrect information.

Quick Start & Requirements

  • Install: conda env update -f environment.yaml followed by conda activate semantic_uncertainty.
  • Prerequisites: Python 3.11, PyTorch 2.1, Conda, Weights & Biases (wandb) account and API key, Hugging Face account and token (potentially requiring access for LLaMa-2 models), OpenAI API key for sentence-length experiments.
  • Hardware: Modern CPU (e.g., Intel 10th gen), 16GB+ RAM. Crucially, one or more NVIDIA GPUs are required for feasible runtime. GPU memory requirements scale with LLM size (7B models: 24GB GPU; 13B models: 80GB GPU; 70B models: 2x80GB GPUs). float16 or int8 precision can reduce memory needs.
  • Setup Time: Approximately 15 minutes for environment setup.
  • Links: Conda, Weights & Biases, Hugging Face, OpenAI API.

Highlighted Details

  • Reproduces experiments from a Nature submission on semantic uncertainty.
  • Supports multiple LLMs (Llama-2, Falcon, Mistral) and datasets (TriviaQA, SQuAD, BioASQ, NQ).
  • Includes scripts for response generation, uncertainty computation, and result analysis.
  • Demo run on Llama-2 Chat (7B) on TriviaQA is estimated at 1 hour with specific hardware.

Maintenance & Community

The repository builds upon a previous, now deprecated, codebase. No specific community channels or active maintenance signals are mentioned in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility with commercial or closed-source projects is not discussed.

Limitations & Caveats

The project relies heavily on specific versions of Python and PyTorch, and the README advises against using the exact environment export. Reproducing sentence-length experiments requires using the OpenAI API, incurring costs. Manual data download is required for the BioASQ dataset.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Travis Fischer Travis Fischer(Founder of Agentic), and
1 more.

HaluEval by RUCAIBox

0.8%
510
Benchmark dataset for LLM hallucination evaluation
Created 2 years ago
Updated 1 year ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Starred by Anastasios Angelopoulos Anastasios Angelopoulos(Cofounder of LMArena), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

transformer-debugger by openai

0.1%
4k
Tool for language model behavior investigation
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.