MemoryAgentBench  by HUST-AI-HYZ

Evaluating LLM agents' memory through incremental interactions

Created 8 months ago
254 stars

Top 99.1% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a standardized framework for evaluating the memory capabilities of Large Language Model (LLM) agents through incremental, multi-turn interactions. It targets researchers and developers building and assessing LLM agents, offering a more efficient benchmark design ("inject once, query multiple times") to assess agent performance in realistic conversational scenarios.

How It Works

The benchmark assesses agents on four core competencies: Accurate Retrieval (AR), Test-Time Learning (TTL), Long-Range Understanding (LRU), and Conflict Resolution (CR). It utilizes reformulated data from existing benchmarks and newly constructed datasets like EventQA and FactConsolidation. Data is segmented into chunks to simulate conversational flow, enabling a systematic evaluation of how agents manage and utilize information over extended interactions.

Quick Start & Requirements

Setup involves creating a dedicated Conda environment (e.g., python=3.10.16) and installing dependencies via pip install torch, pip install -r requirements.txt, and pip install "numpy<2". Users must download processed data from HuggingFace (automatic download is possible) and configure API keys (OpenAI, Anthropic, Google, Cognee) in a .env file. Note that hipporag may cause version conflicts with newer OpenAI models, potentially requiring separate environments or manual package management for cognee and letta. Example evaluation commands for various agent types and LLM-based metric evaluations are provided. The project's paper is available as an arXiv preprint (arXiv:2507.05257).

Highlighted Details

  • The project's paper has been accepted by ICLR 2026.
  • Evaluation focuses on four key memory competencies: AR, TTL, LRU, and CR.
  • Employs an efficient "inject once, query multiple times" data handling strategy.
  • Includes novel datasets: EventQA and FactConsolidation.
  • Supports LLM-based metric evaluation using GPT-4o as a judge.

Maintenance & Community

Recent updates (January 2026) and ICLR 2026 paper acceptance indicate active development. Future plans include a public leaderboard website and a more modular framework for integrating custom memory agents. No direct community links (e.g., Discord, Slack) or social media handles are provided in the README.

Licensing & Compatibility

The software license is not explicitly stated in the README, preventing a clear assessment of compatibility for commercial use or integration into closed-source projects. Dependency versioning, particularly with hipporag and OpenAI models, may affect compatibility.

Limitations & Caveats

The primary adoption blocker is the unspecified software license. Potential dependency conflicts, especially with hipporag and OpenAI versions, may require complex environment management. Key features like a public leaderboard and a flexible agent integration framework are still under development.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
30 stars in the last 30 days

Explore Similar Projects

Starred by Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

ReMe by agentscope-ai

10.0%
2k
LLM chatbot framework for long-term memory
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.