llmperf by ray-project

LLM validation/benchmark library for LLM APIs

Created 2 years ago

1,074 stars

Top 35.3% on SourcePulse

View on GitHub

7 Experts Love This Project

Cody Yu

Coauthor of vLLM; MTS at OpenAI

Ishaan Jaffer

Cofounder of LiteLLM

Robert Nishihara

Cofounder of Anyscale; Author of Ray

Elvis Saravia

Founder of DAIR.AI

and 3 more!

Project Summary

LLMPerf is a library for evaluating the performance and correctness of Large Language Model (LLM) APIs. It is designed for researchers and engineers who need to benchmark different LLM providers and models under various load conditions. The tool helps quantify inter-token latency, generation throughput, and response accuracy.

How It Works

LLMPerf utilizes Ray for distributed execution, enabling it to simulate concurrent requests to LLM APIs. It offers two primary test types: a load test measuring latency and throughput, and a correctness test verifying response accuracy against specific prompts. Token counting is standardized using LlamaTokenizer for consistent comparisons across different LLM backends.

Quick Start & Requirements

Install: git clone https://github.com/ray-project/llmperf.git && cd llmperf && pip install -e .
Prerequisites: Python, Ray, Transformers (LlamaTokenizerFast). API keys and endpoint configurations are required for specific LLM providers.
Documentation: LLMPerf README

Highlighted Details

Supports benchmarking of OpenAI-compatible APIs, Anthropic, TogetherAI, Hugging Face, Vertex AI, and SageMaker endpoints.
Integrates with LiteLLM for broad LLM provider compatibility.
Load tests measure inter-token latency and throughput using Shakespearean sonnet prompts.
Correctness tests validate specific prompt-response patterns, like number conversion.

Maintenance & Community

Developed by the Ray Project.
Legacy codebase available at llmperf-legacy.

Licensing & Compatibility

License: Apache License 2.0.
Compatibility: Suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

Performance results are sensitive to backend implementation, network conditions, and time of day, and may not directly correlate with all user workloads. Vertex AI and SageMaker do not return token counts, necessitating tokenization via LlamaTokenizer for these services.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days