llm-comparator  by PAIR-code

Interactive tool for side-by-side LLM evaluation

Created 1 year ago
480 stars

Top 63.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

LLM Comparator is an interactive visualization tool and Python library for analyzing side-by-side evaluations of Large Language Models (LLMs). It enables users to qualitatively assess differences in LLM responses at both example and slice levels, aiding in the discovery of patterns and reasons for performance variations. The tool is primarily aimed at researchers and developers evaluating LLM outputs.

How It Works

The tool visualizes data from JSON files containing comparative LLM responses. Each entry includes the input prompt, outputs from two models (A and B), and a score indicating which response is preferred (e.g., from an LLM-as-a-judge system). It supports rich metadata and custom fields, allowing for detailed analysis of response characteristics, such as word count, specific stylistic elements, or categorical tags, visualized through interactive charts and tables.

Quick Start & Requirements

  • Interactive Demo: https://pair-code.github.io/llm-comparator/
  • Local Development:
    git clone https://github.com/PAIR-code/llm-comparator.git
    cd llm-comparator
    npm install
    npm run build
    npm run serve
    
  • Python Library: Installable via PyPI for generating JSON evaluation files.
  • Data Format: Requires JSON files adhering to a specific schema, including input_text, output_text_a, output_text_b, and score.

Highlighted Details

  • Supports detailed analysis of LLM-as-a-judge rationales and score distributions.
  • Allows inclusion of custom fields per prompt and per model response for granular analysis.
  • Can visualize individual rater scores to account for position bias and non-determinism.
  • Provides a Python library to automate the creation of evaluation JSON files.

Maintenance & Community

This is a research project under active development by the PAIR team. Further details and potential community engagement channels are not explicitly listed in the README.

Licensing & Compatibility

The project is released under an unspecified license. The disclaimer states "This is not an official Google product," suggesting potential implications for commercial use or integration into proprietary systems.

Limitations & Caveats

The project is described as being in an early stage of development with potential bugs. The license is not specified, which may pose a barrier to commercial adoption or integration into closed-source projects.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
2 more.

reward-bench by allenai

0%
634
Reward model evaluation tool
Created 1 year ago
Updated 3 months ago
Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
7 more.

lighteval by huggingface

2.6%
2k
LLM evaluation toolkit for multiple backends
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.