llm-comparator  by PAIR-code

Interactive tool for side-by-side LLM evaluation

created 1 year ago
461 stars

Top 66.7% on sourcepulse

GitHubView on GitHub
Project Summary

LLM Comparator is an interactive visualization tool and Python library for analyzing side-by-side evaluations of Large Language Models (LLMs). It enables users to qualitatively assess differences in LLM responses at both example and slice levels, aiding in the discovery of patterns and reasons for performance variations. The tool is primarily aimed at researchers and developers evaluating LLM outputs.

How It Works

The tool visualizes data from JSON files containing comparative LLM responses. Each entry includes the input prompt, outputs from two models (A and B), and a score indicating which response is preferred (e.g., from an LLM-as-a-judge system). It supports rich metadata and custom fields, allowing for detailed analysis of response characteristics, such as word count, specific stylistic elements, or categorical tags, visualized through interactive charts and tables.

Quick Start & Requirements

  • Interactive Demo: https://pair-code.github.io/llm-comparator/
  • Local Development:
    git clone https://github.com/PAIR-code/llm-comparator.git
    cd llm-comparator
    npm install
    npm run build
    npm run serve
    
  • Python Library: Installable via PyPI for generating JSON evaluation files.
  • Data Format: Requires JSON files adhering to a specific schema, including input_text, output_text_a, output_text_b, and score.

Highlighted Details

  • Supports detailed analysis of LLM-as-a-judge rationales and score distributions.
  • Allows inclusion of custom fields per prompt and per model response for granular analysis.
  • Can visualize individual rater scores to account for position bias and non-determinism.
  • Provides a Python library to automate the creation of evaluation JSON files.

Maintenance & Community

This is a research project under active development by the PAIR team. Further details and potential community engagement channels are not explicitly listed in the README.

Licensing & Compatibility

The project is released under an unspecified license. The disclaimer states "This is not an official Google product," suggesting potential implications for commercial use or integration into proprietary systems.

Limitations & Caveats

The project is described as being in an early stage of development with potential bugs. The license is not specified, which may pose a barrier to commercial adoption or integration into closed-source projects.

Health Check
Last commit

5 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
51 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.