LongMemEval by xiaowu0162

Long-term memory benchmark for chat assistants

Created 1 year ago

360 stars

Top 77.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Amin Ahmad

Cofounder of Vectara

Project Summary

LongMemEval provides a comprehensive, challenging, and scalable benchmark for evaluating the long-term interactive memory capabilities of chat assistants. Aimed at researchers and developers building advanced conversational AI, it offers a rigorous methodology to assess how well these systems retain and utilize information across extended dialogues, thereby improving their reliability and coherence in real-world applications.

How It Works

The benchmark employs an attribute-controlled pipeline to construct coherent, extensible, and timestamped chat histories for each question. This approach necessitates that chat systems dynamically parse online interactions for memorization, enabling them to answer questions accurately after all interaction sessions have concluded. This methodology allows for the creation of diverse and scalable test cases, specifically designed to probe five core long-term memory abilities: Information Extraction, Multi-Session Reasoning, Knowledge Updates, Temporal Reasoning, and Abstention.

Quick Start & Requirements

The LongMemEval dataset is available on Hugging Face. Data should be downloaded and placed in the data/ directory using provided wget commands. Environment setup is recommended via Conda with Python 3.9. A requirements-lite.txt is available for evaluation-only setups. For running the memory systems described in the paper, a full environment setup using requirements-full.txt is required, including specific PyTorch versions (torch==2.3.1, torchvision==2.3.1, torchaudio==2.3.1) compatible with CUDA 12.1, as tested on Linux.

Dataset: https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned
Paper: https://arxiv.org/abs/2410.10813

Highlighted Details

Features 500 high-quality questions testing five distinct long-term memory abilities.
Includes three dataset variants: longmemeval_s.json (~40 sessions, ~115k tokens), longmemeval_m.json (~500 sessions), and longmemeval_oracle.json (with oracle retrieval).
Supports custom chat history compilation of arbitrary length to scale difficulty.
Provides code for memory retrieval experiments using various retrievers (BM25, Contriever, Stella, GTE) and index granularities (turn, session).
Includes retrieval-augmented generation experiments with options for different reading methods and history formats.

Maintenance & Community

The project is associated with authors Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu, and was accepted at ICLR 2025. The benchmark was released in October 2024. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The README does not explicitly state the software license for the repository or the dataset. This omission requires further investigation for users considering commercial use or integration into closed-source projects.

Limitations & Caveats

The longmemeval_m.json dataset is noted as being too long for standard long-context testing. The provided environment setup was tested on Linux with CUDA 12.1, and users on different platforms may need to adjust requirements. Abstention instances are excluded from retrieval evaluation due to their nature. The absence of a stated license is a significant caveat.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

44 stars in the last 30 days