ALCE  by princeton-nlp

Benchmark for evaluating LLMs' citation abilities

created 2 years ago
490 stars

Top 63.8% on sourcepulse

GitHubView on GitHub
Project Summary

ALCE provides a benchmark and tools for evaluating Large Language Models' ability to generate text with accurate citations. It addresses the need for reliable, verifiable LLM outputs, targeting researchers and developers in NLP and AI. The project enables automatic evaluation of fluency, correctness, and citation quality across three datasets: ASQA, QAMPARI, and ELI5.

How It Works

ALCE evaluates LLM generations using a multi-dimensional approach. It assesses fluency and correctness, and critically, citation quality. The framework supports various generation methods, including vanilla LLM outputs, summarized context, extracted snippets, and interactive document retrieval, allowing for a comprehensive comparison of citation-aware generation strategies.

Quick Start & Requirements

  • Install: pip install torch transformers accelerate openai
  • Prerequisites: PyTorch (tested on 2.1.0.dev20230514+cu118), Transformers (4.28.1), Accelerate (0.17.1), OpenAI API (0.27.4), Python 3.9.7. For retrieval: pyserini (0.21.0), sentence-transformers (2.2.2).
  • Data Download: bash download_data.sh
  • Retrieval Setup: Requires downloading Sphere index for BM25 or DPR Wikipedia snapshot for GTR. GTR dense index building requires ~80GB GPU memory.
  • Links: Paper

Highlighted Details

  • Reproduces baseline generations for various LLMs (OpenAI, Vicuna) and citation methods.
  • Supports both OpenAI API and offline HuggingFace models.
  • Includes post-hoc citation generation using GTR-large.
  • Offers comprehensive evaluation scripts for different datasets and metrics (e.g., Mauve, NLI).

Maintenance & Community

  • Developed by Princeton NLP.
  • Contact: Tianyu Gao (tianyug@cs.princeton.edu) for questions. Issues can be reported via GitHub.

Licensing & Compatibility

  • The repository itself does not explicitly state a license. The included datasets and models may have their own licenses.

Limitations & Caveats

The retrieval setup, particularly for dense embeddings (GTR), is computationally expensive, requiring significant GPU memory and time. Setting up the BM25 index also involves downloading large files and configuring environment variables. The project is tested on specific older versions of dependencies, which may require careful management for compatibility with newer environments.

Health Check
Last commit

9 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.