gbrain-evals by garrytan

AI agent memory evaluation suite

Created 2 months ago

275 stars

Top 93.8% on SourcePulse

Project Summary

Summary

This repository provides a transparent, reproducible, and comprehensive test suite for gbrain, an AI agent's long-term memory system. It targets engineers and researchers evaluating gbrain for adoption, offering detailed benchmarks on recall, precision, consistency, and speed at scale. The primary benefit is enabling users to verify gbrain's performance and reliability independently, rather than relying on vendor claims.

How It Works

Benchmarks comprise a realistic corpus, questions with sealed answers hidden from the system under test, and plain-English metrics: Recall ("was the right thing returned?") and Precision ("how much of what was returned was relevant?"). The approach emphasizes balancing these metrics for real-world agent needs, publishing both strong and weak results. Anti-gaming measures include sealed answer keys, tolerance bands, pinned judge versions, and randomized question order to ensure test integrity.

Quick Start & Requirements

Install: Clone the repo, cd gbrain-evals, then bun install.
Prerequisites: bun runtime, OPENAI_API_KEY, ANTHROPIC_API_KEY (optional, for specific variants), and downloading the longmemeval_s.json dataset.
Resource Footprint: Initial runs incur ~$2 for embeddings; subsequent runs are cached and nearly free. The full suite takes approximately 15 minutes.
Links: Repository: https://github.com/garrytan/gbrain-evals.git. LongMemEval dataset: https://huggingface.co/datasets/xiaowu0162/longmemeval/resolve/main/longmemeval_s.

Highlighted Details

Achieves 97.6% recall@5 on the public LongMemEval dataset, claimed as the best published score without an LLM in the retrieval loop.
Demonstrates 97.9% recall@5 and 49.1% precision@5 on relational questions, outperforming plain vector search by 38 precision points.
Maintains zero regression across 20 releases, ensuring stability.
Offers an opt-in setting for precision-focused tasks, achieving 0.582 precision at a third of the latency of comparable systems.
Tests cover multi-modal ingestion (PDF, audio, HTML) and agent skill optimization.

Maintenance & Community

No specific details on maintainers, community channels (like Discord/Slack), or a public roadmap are provided in the README. Development appears active, with gbrain itself being the reference system under test.

Licensing & Compatibility

The repository and its fictional corpora are licensed under MIT. Vendored precision-test artifacts are also MIT. This license permits commercial use and integration into closed-source projects.

Limitations & Caveats

The default configuration exhibits low precision (0.076) on precision-specific benchmarks, a trade-off explicitly published to highlight honesty about system weaknesses. Certain benchmark variants require API keys, incurring costs. The project utilizes bun as its primary runtime.

gbrain-evals by garrytan

Explore Similar Projects

AgentHarness by ApodexAI

Marco-DeepResearch by AIDC-AI

agent-skills-eval by darkrishabh

awesome-evals by benchflow-ai

BrowseComp-Plus by texttron

MemoryAgentBench by HUST-AI-HYZ

Evaluator by NVIDIA-NeMo

VibeSearchBench by VibeBench

Auto-GPT-Benchmarks by Significant-Gravitas

agent-as-a-judge by metauto-ai

Mind2Web by OSU-NLP-Group

skill by pinchbench