LLM judge for evaluating LLM-generated answers
Top 76.1% on sourcepulse
JudgeLM provides a framework for fine-tuning and deploying Large Language Models (LLMs) as scalable judges for evaluating other LLMs, particularly in open-ended scenarios. It addresses the limitations of traditional benchmarks by leveraging GPT-4-generated judgments on a large dataset to train LLMs to act as sophisticated evaluators, achieving high agreement rates with expert judges.
How It Works
JudgeLM fine-tunes existing LLMs (e.g., LLaMA variants) on a dataset of prompts, LLM-generated answers, and GPT-4 judgments. It employs techniques like swap augmentation, reference support, and reference dropping to mitigate common biases (position, knowledge, format) inherent in LLM-based evaluation. This approach allows for efficient and effective assessment of LLM outputs across various tasks, including multimodal inputs and multi-turn conversations.
Quick Start & Requirements
pip3 install -e .
and pip install flash-attn==2.0.4 --no-build-isolation
.Highlighted Details
Maintenance & Community
The project is associated with BAAI and HUST. It is based on Vicuna, PandaLM, and LLM-Blender. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
JudgeLM is based on LLaMA and must adhere to LLaMA's model license. This implies potential restrictions on commercial use or redistribution depending on the specific LLaMA license terms.
Limitations & Caveats
The project relies on LLaMA base models, inheriting any limitations or licensing restrictions associated with them. While aiming to mitigate biases, the effectiveness of these techniques in all scenarios may require further validation.
5 months ago
1 day