PandaLM by WeOpenML

LLM evaluation benchmark for reproducible, automated assessment

Created 2 years ago

922 stars

Top 39.5% on SourcePulse

Project Summary

PandaLM provides a reproducible and automated framework for evaluating Large Language Models (LLMs), particularly for organizations with confidential data or limited budgets. It addresses the cost, reproducibility, and security concerns associated with human or API-based LLM evaluations by offering an automated comparison system that includes reasoning and reference answers.

How It Works

PandaLM utilizes a trained LLM to compare responses from candidate LLMs. The core approach involves providing the evaluation model with a context, instruction, and two responses. The PandaLM model then determines which response is superior, provides a justification, and can generate a reference answer. This method aims for efficiency and consistency, reducing reliance on expensive human annotators or potentially insecure third-party APIs.

Quick Start & Requirements

Install via pip install -r requirements.txt or conda env create -f conda-env.yml.
Requires Python and potentially CUDA-enabled GPUs (24GB VRAM recommended for local UI).
Official HuggingFace model: WeOpenML/PandaLM-7B-v1.
Demo UI: bash scripts/run-gradio.py --base_model=WeOpenML/PandaLM-7B-v1.
Evaluation Pipeline example: from pandalm import EvaluationPipeline.
More details: PandaLM GitHub Repository

Highlighted Details

PandaLM-7B achieves 93.75% of GPT-3.5's and 88.28% of GPT-4's evaluation ability on a human-annotated test set.
Includes a ~1k sample human-annotated test dataset for validation and a 300K sample filtered training dataset.
Offers codes for training PandaLM, instruction tuning other foundation models (Bloom, OPT, LLaMA), and model weights.
Accepted to ICLR 2024.

Maintenance & Community

Active development with recent updates (May 2024) and ICLR 2024 acceptance.
Contributions are welcomed via pull requests.
Citation details provided for academic referencing.

Licensing & Compatibility

Model weights follow the LLaMA license.
The rest of the repository is under Apache License 2.0.
Train data license to be added upon upload.

Limitations & Caveats

The project notes that instruction-tuned models are not provided directly due to copyright concerns but can be reproduced using provided scripts. The README indicates that more papers and features are coming soon, suggesting ongoing development.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days