PandaLM  by WeOpenML

LLM evaluation benchmark for reproducible, automated assessment

created 2 years ago
921 stars

Top 40.4% on sourcepulse

GitHubView on GitHub
Project Summary

PandaLM provides a reproducible and automated framework for evaluating Large Language Models (LLMs), particularly for organizations with confidential data or limited budgets. It addresses the cost, reproducibility, and security concerns associated with human or API-based LLM evaluations by offering an automated comparison system that includes reasoning and reference answers.

How It Works

PandaLM utilizes a trained LLM to compare responses from candidate LLMs. The core approach involves providing the evaluation model with a context, instruction, and two responses. The PandaLM model then determines which response is superior, provides a justification, and can generate a reference answer. This method aims for efficiency and consistency, reducing reliance on expensive human annotators or potentially insecure third-party APIs.

Quick Start & Requirements

  • Install via pip install -r requirements.txt or conda env create -f conda-env.yml.
  • Requires Python and potentially CUDA-enabled GPUs (24GB VRAM recommended for local UI).
  • Official HuggingFace model: WeOpenML/PandaLM-7B-v1.
  • Demo UI: bash scripts/run-gradio.py --base_model=WeOpenML/PandaLM-7B-v1.
  • Evaluation Pipeline example: from pandalm import EvaluationPipeline.
  • More details: PandaLM GitHub Repository

Highlighted Details

  • PandaLM-7B achieves 93.75% of GPT-3.5's and 88.28% of GPT-4's evaluation ability on a human-annotated test set.
  • Includes a ~1k sample human-annotated test dataset for validation and a 300K sample filtered training dataset.
  • Offers codes for training PandaLM, instruction tuning other foundation models (Bloom, OPT, LLaMA), and model weights.
  • Accepted to ICLR 2024.

Maintenance & Community

  • Active development with recent updates (May 2024) and ICLR 2024 acceptance.
  • Contributions are welcomed via pull requests.
  • Citation details provided for academic referencing.

Licensing & Compatibility

  • Model weights follow the LLaMA license.
  • The rest of the repository is under Apache License 2.0.
  • Train data license to be added upon upload.

Limitations & Caveats

The project notes that instruction-tuned models are not provided directly due to copyright concerns but can be reproduced using provided scripts. The README indicates that more papers and features are coming soon, suggesting ongoing development.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.