gbrain-evals  by garrytan

AI agent memory evaluation suite

Created 2 months ago
275 stars

Top 93.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This repository provides a transparent, reproducible, and comprehensive test suite for gbrain, an AI agent's long-term memory system. It targets engineers and researchers evaluating gbrain for adoption, offering detailed benchmarks on recall, precision, consistency, and speed at scale. The primary benefit is enabling users to verify gbrain's performance and reliability independently, rather than relying on vendor claims.

How It Works

Benchmarks comprise a realistic corpus, questions with sealed answers hidden from the system under test, and plain-English metrics: Recall ("was the right thing returned?") and Precision ("how much of what was returned was relevant?"). The approach emphasizes balancing these metrics for real-world agent needs, publishing both strong and weak results. Anti-gaming measures include sealed answer keys, tolerance bands, pinned judge versions, and randomized question order to ensure test integrity.

Quick Start & Requirements

Highlighted Details

  • Achieves 97.6% recall@5 on the public LongMemEval dataset, claimed as the best published score without an LLM in the retrieval loop.
  • Demonstrates 97.9% recall@5 and 49.1% precision@5 on relational questions, outperforming plain vector search by 38 precision points.
  • Maintains zero regression across 20 releases, ensuring stability.
  • Offers an opt-in setting for precision-focused tasks, achieving 0.582 precision at a third of the latency of comparable systems.
  • Tests cover multi-modal ingestion (PDF, audio, HTML) and agent skill optimization.

Maintenance & Community

No specific details on maintainers, community channels (like Discord/Slack), or a public roadmap are provided in the README. Development appears active, with gbrain itself being the reference system under test.

Licensing & Compatibility

The repository and its fictional corpora are licensed under MIT. Vendored precision-test artifacts are also MIT. This license permits commercial use and integration into closed-source projects.

Limitations & Caveats

The default configuration exhibits low precision (0.076) on precision-specific benchmarks, a trade-off explicitly published to highlight honesty about system weaknesses. Certain benchmark variants require API keys, incurring costs. The project utilizes bun as its primary runtime.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
0
Star History
89 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.