p2l  by lmarena

Prompt-to-Leaderboard for LLM evaluation

Created 7 months ago
254 stars

Top 99.1% on SourcePulse

GitHubView on GitHub
Project Summary

Prompt-to-Leaderboard (P2L) addresses the limitations of aggregated LLM evaluation metrics by enabling prompt-specific leaderboards. This allows for nuanced, unsupervised, and personalized LLM evaluations, as well as optimized query routing and automated assessment of model strengths and weaknesses. The target audience includes researchers and developers working with LLMs who need more granular performance insights.

How It Works

P2L trains a model to take natural language prompts as input and output vectors of Bradley-Terry coefficients. These coefficients are then used to predict human preference votes, generating prompt-dependent leaderboards. This approach captures performance variations across different prompts and users, offering a more detailed view than averaged metrics. The method's ability to produce prompt-specific evaluations scales similarly to LLMs themselves.

Quick Start & Requirements

  • Installation: Uses uv for environment management. Install uv via curl -LsSf https://astral.sh/uv/install.sh | sh, then source $HOME/.local/bin/env. Create and activate a Python 3.10 environment with uv venv .env --python 3.10 and source .env/bin/activate.
  • Serving P2L: uv pip install -r serve_requirements.txt
  • Serving Router: uv pip install -r route/requirements.txt
  • Training: uv pip install -r train_requirements.txt
  • Prerequisites: Python 3.10 (other versions untested). GPU is supported and enabled by default for serving models, but can be disabled with --no-cuda.
  • Docs: The README provides detailed setup and usage instructions for serving, routing, and training.

Highlighted Details

  • Enables prompt-specific leaderboards for nuanced LLM evaluation.
  • Supports serving OpenAI-compatible routers for seamless integration.
  • Includes options for cost-optimal routing using linear programming.
  • Provides tools for training P2L models and performing evaluations on datasets.

Maintenance & Community

The project is associated with LMArena and the paper "Prompt-to-Leaderboard." Further details on community or specific maintainers are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state the license type. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The README mentions that Python versions other than 3.10 are untested. Specific compatibility details for commercial use or closed-source linking are not provided. The optimal-lp cost optimizer is only compatible with BT models, and simple-lp is only compatible with grounded RK models.

Health Check
Last Commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.