ALCE by princeton-nlp

Benchmark for evaluating LLMs' citation abilities

Created 2 years ago

505 stars

Top 61.7% on SourcePulse

View on GitHub

3 Experts Love This Project

Pawel Garbacki

Cofounder of Fireworks AI

Jeff Hammerbacher

Cofounder of Cloudera

Alexander Wettig

Coauthor of SWE-bench, SWE-agent

Project Summary

ALCE provides a benchmark and tools for evaluating Large Language Models' ability to generate text with accurate citations. It addresses the need for reliable, verifiable LLM outputs, targeting researchers and developers in NLP and AI. The project enables automatic evaluation of fluency, correctness, and citation quality across three datasets: ASQA, QAMPARI, and ELI5.

How It Works

ALCE evaluates LLM generations using a multi-dimensional approach. It assesses fluency and correctness, and critically, citation quality. The framework supports various generation methods, including vanilla LLM outputs, summarized context, extracted snippets, and interactive document retrieval, allowing for a comprehensive comparison of citation-aware generation strategies.

Quick Start & Requirements

Install: pip install torch transformers accelerate openai
Prerequisites: PyTorch (tested on 2.1.0.dev20230514+cu118), Transformers (4.28.1), Accelerate (0.17.1), OpenAI API (0.27.4), Python 3.9.7. For retrieval: pyserini (0.21.0), sentence-transformers (2.2.2).
Data Download: bash download_data.sh
Retrieval Setup: Requires downloading Sphere index for BM25 or DPR Wikipedia snapshot for GTR. GTR dense index building requires ~80GB GPU memory.
Links: Paper

Highlighted Details

Reproduces baseline generations for various LLMs (OpenAI, Vicuna) and citation methods.
Supports both OpenAI API and offline HuggingFace models.
Includes post-hoc citation generation using GTR-large.
Offers comprehensive evaluation scripts for different datasets and metrics (e.g., Mauve, NLI).

Maintenance & Community

Developed by Princeton NLP.
Contact: Tianyu Gao (tianyug@cs.princeton.edu) for questions. Issues can be reported via GitHub.

Licensing & Compatibility

The repository itself does not explicitly state a license. The included datasets and models may have their own licenses.

Limitations & Caveats

The retrieval setup, particularly for dense embeddings (GTR), is computationally expensive, requiring significant GPU memory and time. Setting up the BM25 index also involves downloading large files and configuring environment variables. The project is tested on specific older versions of dependencies, which may require careful management for compatibility with newer environments.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days