Discover and explore top open-source AI tools and projects—updated daily.
llm-as-a-verifierA framework for fine-grained LLM verification and trajectory reward modeling
Top 72.8% on SourcePulse
LLM-as-a-Verifier is a general-purpose framework designed to provide fine-grained feedback for evaluating LLM-generated trajectories. It addresses the limitations of single-score evaluations by incorporating scoring granularity, repeated verification, and criteria decomposition, enabling more nuanced assessment. This framework is beneficial for researchers and engineers seeking to rigorously benchmark and improve LLM agent performance on complex tasks.
How It Works
The core of LLM-as-a-Verifier lies in its reward approximation formula, which quantifies a trajectory's reward $R(t, \tau)$ by averaging probabilities assigned by a model $\theta$ across multiple evaluation criteria ($C$), repeated verifications ($K$), and granular score tokens ($G$). This approach avoids oversimplification by not reducing complex distributions to single discrete scores. Candidate trajectories are then compared pairwise in a round-robin tournament, with the trajectory achieving the most wins selected as the best.
Quick Start & Requirements
pip install google-genai tqdm.env file containing a VERTEX_API_KEY is required for logprob extraction.data/terminal_trajs/ and data/swebench_verified_trajs/ respectively.run_terminal_bench.py and run_swe_bench.py are provided.Highlighted Details
Maintenance & Community
No information regarding maintenance, community channels, or notable contributors is present in the provided README snippet.
Licensing & Compatibility
No license information is provided in the README snippet.
Limitations & Caveats
The framework requires access to Google's Vertex AI for logprob extraction, necessitating an API key. The evaluation relies on pre-downloaded trajectory datasets for specific benchmarks.
1 month ago
Inactive
eddycmu
JinjieNi
KhoomeiK
OpenGenerativeAI