llm-as-a-verifier by llm-as-a-verifier

A framework for fine-grained LLM verification and trajectory reward modeling

Created 3 months ago

476 stars

Top 63.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Victor Taelin

Author of Bend, Kind, HVM

Project Summary

LLM-as-a-Verifier is a general-purpose framework designed to provide fine-grained feedback for evaluating LLM-generated trajectories. It addresses the limitations of single-score evaluations by incorporating scoring granularity, repeated verification, and criteria decomposition, enabling more nuanced assessment. This framework is beneficial for researchers and engineers seeking to rigorously benchmark and improve LLM agent performance on complex tasks.

How It Works

The core of LLM-as-a-Verifier lies in its reward approximation formula, which quantifies a trajectory's reward $R(t, \tau)$ by averaging probabilities assigned by a model $\theta$ across multiple evaluation criteria ($C$), repeated verifications ($K$), and granular score tokens ($G$). This approach avoids oversimplification by not reducing complex distributions to single discrete scores. Candidate trajectories are then compared pairwise in a round-robin tournament, with the trajectory achieving the most wins selected as the best.

Quick Start & Requirements

Installation: pip install google-genai tqdm
Prerequisites: A .env file containing a VERTEX_API_KEY is required for logprob extraction.
Data: Pre-collected trajectory data for Terminal-Bench 2.0 and SWE-bench Verified must be placed in data/terminal_trajs/ and data/swebench_verified_trajs/ respectively.
Links: Evaluation scripts run_terminal_bench.py and run_swe_bench.py are provided.

Highlighted Details

Achieves state-of-the-art performance on Terminal-Bench 2 (86.4% score) and SWE-Bench Verified (77.8% score) when used as a trajectory reward model.
Outperforms standard Pass@1 metrics on both benchmarks (81.8% vs 86.4% on Terminal-Bench, 76.1% vs 77.8% on SWE-Bench Verified).
The framework scales scoring granularity, allows for repeated verification, and decomposes evaluation criteria for detailed feedback.

Maintenance & Community

No information regarding maintenance, community channels, or notable contributors is present in the provided README snippet.

Licensing & Compatibility

No license information is provided in the README snippet.

Limitations & Caveats

The framework requires access to Google's Vertex AI for logprob extraction, necessitating an API key. The evaluation relies on pre-downloaded trajectory datasets for specific benchmarks.

Health Check

Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

75 stars in the last 30 days