Discover and explore top open-source AI tools and projects—updated daily.
xiaowu0162Long-term memory benchmark for chat assistants
Top 98.4% on SourcePulse
LongMemEval provides a comprehensive, challenging, and scalable benchmark for evaluating the long-term interactive memory capabilities of chat assistants. Aimed at researchers and developers building advanced conversational AI, it offers a rigorous methodology to assess how well these systems retain and utilize information across extended dialogues, thereby improving their reliability and coherence in real-world applications.
How It Works
The benchmark employs an attribute-controlled pipeline to construct coherent, extensible, and timestamped chat histories for each question. This approach necessitates that chat systems dynamically parse online interactions for memorization, enabling them to answer questions accurately after all interaction sessions have concluded. This methodology allows for the creation of diverse and scalable test cases, specifically designed to probe five core long-term memory abilities: Information Extraction, Multi-Session Reasoning, Knowledge Updates, Temporal Reasoning, and Abstention.
Quick Start & Requirements
The LongMemEval dataset is available on Hugging Face. Data should be downloaded and placed in the data/ directory using provided wget commands. Environment setup is recommended via Conda with Python 3.9. A requirements-lite.txt is available for evaluation-only setups. For running the memory systems described in the paper, a full environment setup using requirements-full.txt is required, including specific PyTorch versions (torch==2.3.1, torchvision==2.3.1, torchaudio==2.3.1) compatible with CUDA 12.1, as tested on Linux.
Highlighted Details
longmemeval_s.json (~40 sessions, ~115k tokens), longmemeval_m.json (~500 sessions), and longmemeval_oracle.json (with oracle retrieval).Maintenance & Community
The project is associated with authors Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu, and was accepted at ICLR 2025. The benchmark was released in October 2024. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.
Licensing & Compatibility
The README does not explicitly state the software license for the repository or the dataset. This omission requires further investigation for users considering commercial use or integration into closed-source projects.
Limitations & Caveats
The longmemeval_m.json dataset is noted as being too long for standard long-context testing. The provided environment setup was tested on Linux with CUDA 12.1, and users on different platforms may need to adjust requirements. Abstention instances are excluded from retrieval evaluation due to their nature. The absence of a stated license is a significant caveat.
1 week ago
Inactive
stanford-futuredata