llama-benchy by eugr

LLM inference benchmarking tool for OpenAI-compatible endpoints

Created 6 months ago

543 stars

Top 57.9% on SourcePulse

Project Summary

This tool addresses the challenge of benchmarking Large Language Model (LLM) inference endpoints, particularly for backends beyond llama.cpp and for accurately measuring prompt processing speeds at varying context lengths. It targets engineers, researchers, and power users evaluating LLM performance, offering a standardized method to assess prompt processing and token generation speeds, TTFR, est_ppt, and e2e_ttft across different OpenAI-compatible services.

How It Works

llama-benchy operates by sending requests to OpenAI-compatible LLM endpoints, systematically varying prompt lengths, generation lengths, and context depths. It measures key performance indicators like Prompt Processing (pp) and Token Generation (tg) speeds, alongside Time To First Response (TTFR), Estimated Prompt Processing Time (est_ppt), and End-to-End Time To First Token (e2e_ttft). The tool leverages HuggingFace tokenizers for accuracy and uses realistic text from Project Gutenberg for prompt generation, aiding in the evaluation of speculative decoding and MTP. A notable feature is its ability to benchmark prefix caching performance, providing insights into how well inference servers handle repeated contexts.

Quick Start & Requirements

Installation is recommended using uv. The simplest way to run is via uvx llama-benchy --base-url <ENDPOINT_URL> --model <MODEL_NAME>. Alternatively, clone the repository and install using uv pip install -e . within a virtual environment. Key requirements include an OpenAI-compatible LLM endpoint.

Repository: https://github.com/eugr/llama-benchy
uv Installation: https://docs.astral.sh/uv/getting-started/installation/

Highlighted Details

Measures prompt processing and token generation speeds across configurable context depths.
Reports TTFR, est_ppt, and e2e_ttft with flexible latency measurement modes ('api', 'generation').
Supports concurrent requests to benchmark throughput under load.
Includes specific benchmarking for prefix caching effectiveness.
Outputs results in Markdown, JSON, or CSV formats, with options for detailed time-series data.

Maintenance & Community

The project shows recent development activity, with a version dated February 6, 2026. No specific community links (e.g., Discord, Slack) or notable contributors are detailed in the provided README.

Licensing & Compatibility

The license type is not explicitly stated in the provided README. This omission requires further investigation for commercial use or closed-source integration.

llama-benchy by eugr

Explore Similar Projects

mini-infer by psmarter

llmperf-leaderboard by ray-project

BurstGPT by HPMLL

dash-infer by modelscope

ScaleLLM by vectorch-ai

vattention by microsoft

ToolCall-15 by stevibe

beyondllm by aiplanethub

llm-benchmark by lework

genai-bench by sgl-project

pyllms by kagisearch

vllm-mlx by waybarrios