Discover and explore top open-source AI tools and projects—updated daily.
garrytanAI agent memory evaluation suite
Top 93.8% on SourcePulse
Summary
This repository provides a transparent, reproducible, and comprehensive test suite for gbrain, an AI agent's long-term memory system. It targets engineers and researchers evaluating gbrain for adoption, offering detailed benchmarks on recall, precision, consistency, and speed at scale. The primary benefit is enabling users to verify gbrain's performance and reliability independently, rather than relying on vendor claims.
How It Works
Benchmarks comprise a realistic corpus, questions with sealed answers hidden from the system under test, and plain-English metrics: Recall ("was the right thing returned?") and Precision ("how much of what was returned was relevant?"). The approach emphasizes balancing these metrics for real-world agent needs, publishing both strong and weak results. Anti-gaming measures include sealed answer keys, tolerance bands, pinned judge versions, and randomized question order to ensure test integrity.
Quick Start & Requirements
cd gbrain-evals, then bun install.bun runtime, OPENAI_API_KEY, ANTHROPIC_API_KEY (optional, for specific variants), and downloading the longmemeval_s.json dataset.Highlighted Details
Maintenance & Community
No specific details on maintainers, community channels (like Discord/Slack), or a public roadmap are provided in the README. Development appears active, with gbrain itself being the reference system under test.
Licensing & Compatibility
The repository and its fictional corpora are licensed under MIT. Vendored precision-test artifacts are also MIT. This license permits commercial use and integration into closed-source projects.
Limitations & Caveats
The default configuration exhibits low precision (0.076) on precision-specific benchmarks, a trade-off explicitly published to highlight honesty about system weaknesses. Certain benchmark variants require API keys, incurring costs. The project utilizes bun as its primary runtime.
2 weeks ago
Inactive