Benchmark for evaluating LLMs' citation abilities
Top 63.8% on sourcepulse
ALCE provides a benchmark and tools for evaluating Large Language Models' ability to generate text with accurate citations. It addresses the need for reliable, verifiable LLM outputs, targeting researchers and developers in NLP and AI. The project enables automatic evaluation of fluency, correctness, and citation quality across three datasets: ASQA, QAMPARI, and ELI5.
How It Works
ALCE evaluates LLM generations using a multi-dimensional approach. It assesses fluency and correctness, and critically, citation quality. The framework supports various generation methods, including vanilla LLM outputs, summarized context, extracted snippets, and interactive document retrieval, allowing for a comprehensive comparison of citation-aware generation strategies.
Quick Start & Requirements
pip install torch transformers accelerate openai
bash download_data.sh
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The retrieval setup, particularly for dense embeddings (GTR), is computationally expensive, requiring significant GPU memory and time. Setting up the BM25 index also involves downloading large files and configuring environment variables. The project is tested on specific older versions of dependencies, which may require careful management for compatibility with newer environments.
9 months ago
1 week