llm-comparator by PAIR-code

Interactive tool for side-by-side LLM evaluation

Created 1 year ago

509 stars

Top 61.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Elvis Saravia

Founder of DAIR.AI

Project Summary

LLM Comparator is an interactive visualization tool and Python library for analyzing side-by-side evaluations of Large Language Models (LLMs). It enables users to qualitatively assess differences in LLM responses at both example and slice levels, aiding in the discovery of patterns and reasons for performance variations. The tool is primarily aimed at researchers and developers evaluating LLM outputs.

How It Works

The tool visualizes data from JSON files containing comparative LLM responses. Each entry includes the input prompt, outputs from two models (A and B), and a score indicating which response is preferred (e.g., from an LLM-as-a-judge system). It supports rich metadata and custom fields, allowing for detailed analysis of response characteristics, such as word count, specific stylistic elements, or categorical tags, visualized through interactive charts and tables.

Quick Start & Requirements

Interactive Demo: https://pair-code.github.io/llm-comparator/

Local Development:

git clone https://github.com/PAIR-code/llm-comparator.git
cd llm-comparator
npm install
npm run build
npm run serve

Python Library: Installable via PyPI for generating JSON evaluation files.
Data Format: Requires JSON files adhering to a specific schema, including input_text, output_text_a, output_text_b, and score.

Highlighted Details

Supports detailed analysis of LLM-as-a-judge rationales and score distributions.
Allows inclusion of custom fields per prompt and per model response for granular analysis.
Can visualize individual rater scores to account for position bias and non-determinism.
Provides a Python library to automate the creation of evaluation JSON files.

Maintenance & Community

This is a research project under active development by the PAIR team. Further details and potential community engagement channels are not explicitly listed in the README.

Licensing & Compatibility

The project is released under an unspecified license. The disclaimer states "This is not an official Google product," suggesting potential implications for commercial use or integration into proprietary systems.

Limitations & Caveats

The project is described as being in an early stage of development with potential bugs. The license is not specified, which may pose a barrier to commercial adoption or integration into closed-source projects.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days