llmperf-leaderboard by ray-project

LLM inference provider benchmark using LLMPerf

Created 2 years ago

475 stars

Top 64.2% on SourcePulse

View on GitHub

4 Experts Love This Project

Junyang Lin

Core Maintainer at Alibaba Qwen

Cofounder of Anyscale

Project Summary

This repository provides a leaderboard for Large Language Model (LLM) inference providers, benchmarking their performance and reliability. It targets developers and users seeking to understand and compare the throughput and latency of various LLM providers, aiding in deployment decisions.

How It Works

The project benchmarks LLM inference providers using the LLMPerf framework. It measures output tokens throughput (tokens/second) and Time to First Token (TTFT) in seconds. Benchmarks are conducted with specific configurations: 150 total requests, 5 concurrent requests, 550 mean input tokens, and 150 mean output tokens. This approach offers a standardized comparison across providers.

Quick Start & Requirements

Install/Run: Use the command template python token_benchmark_ray.py from the LLMPerf repository, specifying model and provider details.
Prerequisites: AWS EC2 instance (e.g., i4i.large) in us-west-2, Python.
Resources: Raw data is available in the raw_data folder.
Details: Benchmark configurations and results for Llama-2 7B, 13B, and 70B models are provided.

Highlighted Details

Benchmarks focus on output tokens throughput and Time to First Token (TTFT).
Evaluates providers like Anyscale, Bedrock, Fireworks, Groq, Lepton, Perplexity, Replicate, and Together.
Results are presented in detailed tables for different model sizes (7B, 13B, 70B).
Data was collected as of December 19, 2023.

Maintenance & Community

Feedback can be provided via a link.
Providers interested in being featured can submit an issue or contact via email.

Licensing & Compatibility

The repository's license is not explicitly stated in the README.

Limitations & Caveats

The results are subject to potential biases, including variations in provider backend implementations, time of day, client location affecting TTFT measurements, and existing system load or provider traffic. The measurements are a proxy and may not perfectly correlate with all user workloads.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days