p2l by lmarena

Prompt-to-Leaderboard for LLM evaluation

Created 11 months ago

271 stars

Top 95.1% on SourcePulse

View on GitHub

4 Experts Love This Project

Jasper Zhang

Cofounder of Hyperbolic

Anastasios Angelopoulos

Cofounder of LMArena

Project Summary

Prompt-to-Leaderboard (P2L) addresses the limitations of aggregated LLM evaluation metrics by enabling prompt-specific leaderboards. This allows for nuanced, unsupervised, and personalized LLM evaluations, as well as optimized query routing and automated assessment of model strengths and weaknesses. The target audience includes researchers and developers working with LLMs who need more granular performance insights.

How It Works

P2L trains a model to take natural language prompts as input and output vectors of Bradley-Terry coefficients. These coefficients are then used to predict human preference votes, generating prompt-dependent leaderboards. This approach captures performance variations across different prompts and users, offering a more detailed view than averaged metrics. The method's ability to produce prompt-specific evaluations scales similarly to LLMs themselves.

Quick Start & Requirements

Installation: Uses uv for environment management. Install uv via curl -LsSf https://astral.sh/uv/install.sh | sh, then source $HOME/.local/bin/env. Create and activate a Python 3.10 environment with uv venv .env --python 3.10 and source .env/bin/activate.
Serving P2L: uv pip install -r serve_requirements.txt
Serving Router: uv pip install -r route/requirements.txt
Training: uv pip install -r train_requirements.txt
Prerequisites: Python 3.10 (other versions untested). GPU is supported and enabled by default for serving models, but can be disabled with --no-cuda.
Docs: The README provides detailed setup and usage instructions for serving, routing, and training.

Highlighted Details

Enables prompt-specific leaderboards for nuanced LLM evaluation.
Supports serving OpenAI-compatible routers for seamless integration.
Includes options for cost-optimal routing using linear programming.
Provides tools for training P2L models and performing evaluations on datasets.

Maintenance & Community

The project is associated with LMArena and the paper "Prompt-to-Leaderboard." Further details on community or specific maintainers are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state the license type. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The README mentions that Python versions other than 3.10 are untested. Specific compatibility details for commercial use or closed-source linking are not provided. The optimal-lp cost optimizer is only compatible with BT models, and simple-lp is only compatible with grounded RK models.

Health Check

Last Commit

8 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days