MemoryAgentBench by HUST-AI-HYZ

Evaluating LLM agents' memory through incremental interactions

Created 11 months ago

360 stars

Top 77.9% on SourcePulse

Project Summary

This project provides a standardized framework for evaluating the memory capabilities of Large Language Model (LLM) agents through incremental, multi-turn interactions. It targets researchers and developers building and assessing LLM agents, offering a more efficient benchmark design ("inject once, query multiple times") to assess agent performance in realistic conversational scenarios.

How It Works

The benchmark assesses agents on four core competencies: Accurate Retrieval (AR), Test-Time Learning (TTL), Long-Range Understanding (LRU), and Conflict Resolution (CR). It utilizes reformulated data from existing benchmarks and newly constructed datasets like EventQA and FactConsolidation. Data is segmented into chunks to simulate conversational flow, enabling a systematic evaluation of how agents manage and utilize information over extended interactions.

Quick Start & Requirements

Setup involves creating a dedicated Conda environment (e.g., python=3.10.16) and installing dependencies via pip install torch, pip install -r requirements.txt, and pip install "numpy<2". Users must download processed data from HuggingFace (automatic download is possible) and configure API keys (OpenAI, Anthropic, Google, Cognee) in a .env file. Note that hipporag may cause version conflicts with newer OpenAI models, potentially requiring separate environments or manual package management for cognee and letta. Example evaluation commands for various agent types and LLM-based metric evaluations are provided. The project's paper is available as an arXiv preprint (arXiv:2507.05257).

Highlighted Details

The project's paper has been accepted by ICLR 2026.
Evaluation focuses on four key memory competencies: AR, TTL, LRU, and CR.
Employs an efficient "inject once, query multiple times" data handling strategy.
Includes novel datasets: EventQA and FactConsolidation.
Supports LLM-based metric evaluation using GPT-4o as a judge.

Maintenance & Community

Recent updates (January 2026) and ICLR 2026 paper acceptance indicate active development. Future plans include a public leaderboard website and a more modular framework for integrating custom memory agents. No direct community links (e.g., Discord, Slack) or social media handles are provided in the README.

Licensing & Compatibility

The software license is not explicitly stated in the README, preventing a clear assessment of compatibility for commercial use or integration into closed-source projects. Dependency versioning, particularly with hipporag and OpenAI models, may affect compatibility.

Limitations & Caveats

The primary adoption blocker is the unspecified software license. Potential dependency conflicts, especially with hipporag and OpenAI versions, may require complex environment management. Key features like a public leaderboard and a flexible agent integration framework are still under development.

MemoryAgentBench by HUST-AI-HYZ

Explore Similar Projects

Awesome-Efficient-Agents by yxf203

VibeSearchBench by VibeBench

MEM1 by MIT-MI

open-eqa by facebookresearch

agentops by boshu2

Awesome-Agent-Memory by TeleAI-UAGI

memind by openmemind

locomo by snap-research

AgentTuning by THUDM

KwaiAgents by KwaiKEG

ReMe by agentscope-ai

recipe-chatbot by ai-evals-course