LLM evaluation benchmark for reproducible, automated assessment
Top 40.4% on sourcepulse
PandaLM provides a reproducible and automated framework for evaluating Large Language Models (LLMs), particularly for organizations with confidential data or limited budgets. It addresses the cost, reproducibility, and security concerns associated with human or API-based LLM evaluations by offering an automated comparison system that includes reasoning and reference answers.
How It Works
PandaLM utilizes a trained LLM to compare responses from candidate LLMs. The core approach involves providing the evaluation model with a context, instruction, and two responses. The PandaLM model then determines which response is superior, provides a justification, and can generate a reference answer. This method aims for efficiency and consistency, reducing reliance on expensive human annotators or potentially insecure third-party APIs.
Quick Start & Requirements
pip install -r requirements.txt
or conda env create -f conda-env.yml
.WeOpenML/PandaLM-7B-v1
.bash scripts/run-gradio.py --base_model=WeOpenML/PandaLM-7B-v1
.from pandalm import EvaluationPipeline
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project notes that instruction-tuned models are not provided directly due to copyright concerns but can be reproduced using provided scripts. The README indicates that more papers and features are coming soon, suggesting ongoing development.
1 year ago
1 day