JudgeLM by baaivision

LLM judge for evaluating LLM-generated answers

Created 2 years ago

412 stars

Top 70.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Edward Sun

Research Scientist at Meta Superintelligence Lab

Project Summary

JudgeLM provides a framework for fine-tuning and deploying Large Language Models (LLMs) as scalable judges for evaluating other LLMs, particularly in open-ended scenarios. It addresses the limitations of traditional benchmarks by leveraging GPT-4-generated judgments on a large dataset to train LLMs to act as sophisticated evaluators, achieving high agreement rates with expert judges.

How It Works

JudgeLM fine-tunes existing LLMs (e.g., LLaMA variants) on a dataset of prompts, LLM-generated answers, and GPT-4 judgments. It employs techniques like swap augmentation, reference support, and reference dropping to mitigate common biases (position, knowledge, format) inherent in LLM-based evaluation. This approach allows for efficient and effective assessment of LLM outputs across various tasks, including multimodal inputs and multi-turn conversations.

Quick Start & Requirements

Install: Clone the repository, activate a Python 3.10.10 conda environment, and run pip3 install -e . and pip install flash-attn==2.0.4 --no-build-isolation.
Prerequisites: Requires LLaMA model weights (under LLaMA's license), Python 3.10.10, and potentially multiple A100 GPUs for training/serving. FlashAttention v2.0.4 is recommended.
Resources: Training JudgeLM-7B with 4x A100 (40GB) is demonstrated. Serving includes a Gradio web UI.
Links: OpenReview, HuggingFace Datasets, Demo.

Highlighted Details

Achieves >90% agreement with teacher judges, surpassing human-to-human agreement.
Trained on a 100K sample dataset with GPT-4 generated judgments.
Supports judging single answers, multimodal models, multiple answers, and multi-turn chat.
Offers a distributed multi-model serving system with a web UI.

Maintenance & Community

The project is associated with BAAI and HUST. It is based on Vicuna, PandaLM, and LLM-Blender. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

JudgeLM is based on LLaMA and must adhere to LLaMA's model license. This implies potential restrictions on commercial use or redistribution depending on the specific LLaMA license terms.

Limitations & Caveats

The project relies on LLaMA base models, inheriting any limitations or licensing restrictions associated with them. While aiming to mitigate biases, the effectiveness of these techniques in all scenarios may require further validation.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days