llm-as-a-verifier  by llm-as-a-verifier

A framework for fine-grained LLM verification and trajectory reward modeling

Created 1 month ago
394 stars

Top 72.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

LLM-as-a-Verifier is a general-purpose framework designed to provide fine-grained feedback for evaluating LLM-generated trajectories. It addresses the limitations of single-score evaluations by incorporating scoring granularity, repeated verification, and criteria decomposition, enabling more nuanced assessment. This framework is beneficial for researchers and engineers seeking to rigorously benchmark and improve LLM agent performance on complex tasks.

How It Works

The core of LLM-as-a-Verifier lies in its reward approximation formula, which quantifies a trajectory's reward $R(t, \tau)$ by averaging probabilities assigned by a model $\theta$ across multiple evaluation criteria ($C$), repeated verifications ($K$), and granular score tokens ($G$). This approach avoids oversimplification by not reducing complex distributions to single discrete scores. Candidate trajectories are then compared pairwise in a round-robin tournament, with the trajectory achieving the most wins selected as the best.

Quick Start & Requirements

  • Installation: pip install google-genai tqdm
  • Prerequisites: A .env file containing a VERTEX_API_KEY is required for logprob extraction.
  • Data: Pre-collected trajectory data for Terminal-Bench 2.0 and SWE-bench Verified must be placed in data/terminal_trajs/ and data/swebench_verified_trajs/ respectively.
  • Links: Evaluation scripts run_terminal_bench.py and run_swe_bench.py are provided.

Highlighted Details

  • Achieves state-of-the-art performance on Terminal-Bench 2 (86.4% score) and SWE-Bench Verified (77.8% score) when used as a trajectory reward model.
  • Outperforms standard Pass@1 metrics on both benchmarks (81.8% vs 86.4% on Terminal-Bench, 76.1% vs 77.8% on SWE-Bench Verified).
  • The framework scales scoring granularity, allows for repeated verification, and decomposes evaluation criteria for detailed feedback.

Maintenance & Community

No information regarding maintenance, community channels, or notable contributors is present in the provided README snippet.

Licensing & Compatibility

No license information is provided in the README snippet.

Limitations & Caveats

The framework requires access to Google's Vertex AI for logprob extraction, necessitating an API key. The evaluation relies on pre-downloaded trajectory datasets for specific benchmarks.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
88 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.