memorybench by supermemoryai

Evaluate conversational memory and RAG systems with a unified benchmarking framework

Created 7 months ago

255 stars

Top 98.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Project Summary

Summary

MemoryBench is a unified, pluggable benchmarking framework designed to evaluate the performance of conversational memory and Retrieval Augmented Generation (RAG) systems. It targets engineers and researchers needing to rigorously assess LLM context management capabilities across diverse datasets and providers. The framework offers interoperability, allowing users to mix and match benchmarks, memory providers, and LLM judges, facilitating direct, side-by-side comparisons and detailed performance analysis.

How It Works

The core of MemoryBench is a modular pipeline encompassing Ingest, Indexing, Search, Answer, Evaluate, and Report stages. Its pluggable architecture allows seamless integration of custom benchmarks (e.g., LoCoMo, LongMem) and memory providers (e.g., Supermem, Mem0, Zep) without code modification. The system is judge-agnostic, supporting various LLMs (GPT-4o, Claude, Gemini) for evaluation. Key advantages include checkpointed runs for resilience, multi-provider comparison capabilities, and structured reporting with a novel MemScore metric (accuracy/latency/tokens) to capture nuanced performance trade-offs.

Quick Start & Requirements

Installation involves cloning the repository and running bun install. Users must configure API keys for desired providers and judges by copying .env.example to .env.local. A primary command is bun run src/index.ts run -p <provider> -b <benchmark>. Prerequisites include the Bun.js runtime and API access credentials for services like OpenAI, Anthropic, and Google.

Highlighted Details

Interoperability: Easily integrate custom benchmarks, providers, and judges.
Checkpointing: Pipeline stages checkpoint independently, allowing runs to resume from failures.
Multi-Provider Comparison: Run benchmarks across multiple memory systems simultaneously for direct evaluation.
Judge-Agnostic: Swap evaluation LLMs without altering core benchmark logic.
Web UI: An interactive interface provides real-time inspection of runs, questions, and failures.
MemScore Metric: A composite score (accuracy/latency/tokens) offers a multi-dimensional view of performance.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (e.g., Discord, Slack), or project roadmaps.

Licensing & Compatibility

The project is released under the MIT license, which permits broad usage, including commercial applications and integration into closed-source systems.

Limitations & Caveats

The framework's extensibility relies on user contributions for new providers, benchmarks, or judges. Performance and reliability may vary based on the specific implementations of these pluggable components. Setting up the necessary API keys for various LLM services is a prerequisite for execution. The README does not specify alpha/beta status or known bugs.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

18 stars in the last 30 days