Discover and explore top open-source AI tools and projects—updated daily.
HUST-AI-HYZEvaluating LLM agents' memory through incremental interactions
Top 99.1% on SourcePulse
This project provides a standardized framework for evaluating the memory capabilities of Large Language Model (LLM) agents through incremental, multi-turn interactions. It targets researchers and developers building and assessing LLM agents, offering a more efficient benchmark design ("inject once, query multiple times") to assess agent performance in realistic conversational scenarios.
How It Works
The benchmark assesses agents on four core competencies: Accurate Retrieval (AR), Test-Time Learning (TTL), Long-Range Understanding (LRU), and Conflict Resolution (CR). It utilizes reformulated data from existing benchmarks and newly constructed datasets like EventQA and FactConsolidation. Data is segmented into chunks to simulate conversational flow, enabling a systematic evaluation of how agents manage and utilize information over extended interactions.
Quick Start & Requirements
Setup involves creating a dedicated Conda environment (e.g., python=3.10.16) and installing dependencies via pip install torch, pip install -r requirements.txt, and pip install "numpy<2". Users must download processed data from HuggingFace (automatic download is possible) and configure API keys (OpenAI, Anthropic, Google, Cognee) in a .env file. Note that hipporag may cause version conflicts with newer OpenAI models, potentially requiring separate environments or manual package management for cognee and letta. Example evaluation commands for various agent types and LLM-based metric evaluations are provided. The project's paper is available as an arXiv preprint (arXiv:2507.05257).
Highlighted Details
Maintenance & Community
Recent updates (January 2026) and ICLR 2026 paper acceptance indicate active development. Future plans include a public leaderboard website and a more modular framework for integrating custom memory agents. No direct community links (e.g., Discord, Slack) or social media handles are provided in the README.
Licensing & Compatibility
The software license is not explicitly stated in the README, preventing a clear assessment of compatibility for commercial use or integration into closed-source projects. Dependency versioning, particularly with hipporag and OpenAI models, may affect compatibility.
Limitations & Caveats
The primary adoption blocker is the unspecified software license. Potential dependency conflicts, especially with hipporag and OpenAI versions, may require complex environment management. Key features like a public leaderboard and a flexible agent integration framework are still under development.
1 month ago
Inactive
THUDM
agentscope-ai