awesome-evals  by benchflow-ai

AI agent evaluation: Curated resources

Created 3 days ago

New!

472 stars

Top 63.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This curated, opinionated library addresses the challenge of navigating the rapidly evolving landscape of AI agent evaluation. It provides engineers, researchers, and practitioners with a verified, non-discursive collection of essential papers, blogs, talks, tools, and benchmarks, significantly reducing the time and effort required to identify high-quality resources for building and evaluating AI agents.

How It Works

This resource distinguishes itself from typical "awesome" lists through rigorous curation and verification. Its methodology includes a depth-4 recursive citation crawl of academic papers, targeted discovery of industry sources missed by citation graphs, transcription and deep-noting of talks and podcasts, and adversarial verification to identify gaps. Every entry is annotated with its purpose and relevance, URLs are checked, and dead or abandoned tools are pruned, ensuring a high signal-to-noise ratio.

Quick Start & Requirements

This repository is a curated list of resources and does not require installation or specific prerequisites. It serves as a knowledge base rather than a runnable tool.

Highlighted Details

  • Features over 443 curated links, including 146 deep reading notes.
  • Employs a "non-BS" approach, verifying all entries and pruning outdated or irrelevant content.
  • Utilizes advanced research methods like citation graph analysis and adversarial verification for comprehensive coverage.
  • Includes markers for new (2025–2026) or potentially problematic entries (⚠️).

Maintenance & Community

The library is maintained by BenchFlow. Contributions are welcomed, with guidelines provided in a CONTRIBUTING.md file.

Licensing & Compatibility

To the extent possible under law, BenchFlow and contributors have waived all copyright and related rights to this work, licensed under CC0 1.0. Linked resources remain under their respective licenses. This permissive licensing facilitates broad adoption and use of the curated information.

Limitations & Caveats

While the list itself is rigorously maintained, some linked resources may have unverified URLs (marked with ⚠️). The README also extensively details common issues and limitations found within the AI evaluation field itself, such as benchmark contamination, label errors, and leaderboard gaming, providing critical context for users evaluating the referenced tools and papers.

Health Check
Last Commit

20 hours ago

Responsiveness

Inactive

Pull Requests (30d)
11
Issues (30d)
2
Star History
476 stars in the last 3 days

Explore Similar Projects

Feedback? Help us improve.