JudgeLM  by baaivision

LLM judge for evaluating LLM-generated answers

created 1 year ago
380 stars

Top 76.1% on sourcepulse

GitHubView on GitHub
Project Summary

JudgeLM provides a framework for fine-tuning and deploying Large Language Models (LLMs) as scalable judges for evaluating other LLMs, particularly in open-ended scenarios. It addresses the limitations of traditional benchmarks by leveraging GPT-4-generated judgments on a large dataset to train LLMs to act as sophisticated evaluators, achieving high agreement rates with expert judges.

How It Works

JudgeLM fine-tunes existing LLMs (e.g., LLaMA variants) on a dataset of prompts, LLM-generated answers, and GPT-4 judgments. It employs techniques like swap augmentation, reference support, and reference dropping to mitigate common biases (position, knowledge, format) inherent in LLM-based evaluation. This approach allows for efficient and effective assessment of LLM outputs across various tasks, including multimodal inputs and multi-turn conversations.

Quick Start & Requirements

  • Install: Clone the repository, activate a Python 3.10.10 conda environment, and run pip3 install -e . and pip install flash-attn==2.0.4 --no-build-isolation.
  • Prerequisites: Requires LLaMA model weights (under LLaMA's license), Python 3.10.10, and potentially multiple A100 GPUs for training/serving. FlashAttention v2.0.4 is recommended.
  • Resources: Training JudgeLM-7B with 4x A100 (40GB) is demonstrated. Serving includes a Gradio web UI.
  • Links: OpenReview, HuggingFace Datasets, Demo.

Highlighted Details

  • Achieves >90% agreement with teacher judges, surpassing human-to-human agreement.
  • Trained on a 100K sample dataset with GPT-4 generated judgments.
  • Supports judging single answers, multimodal models, multiple answers, and multi-turn chat.
  • Offers a distributed multi-model serving system with a web UI.

Maintenance & Community

The project is associated with BAAI and HUST. It is based on Vicuna, PandaLM, and LLM-Blender. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

JudgeLM is based on LLaMA and must adhere to LLaMA's model license. This implies potential restrictions on commercial use or redistribution depending on the specific LLaMA license terms.

Limitations & Caveats

The project relies on LLaMA base models, inheriting any limitations or licensing restrictions associated with them. While aiming to mitigate biases, the effectiveness of these techniques in all scenarios may require further validation.

Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
22 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.