llama-benchy  by eugr

LLM inference benchmarking tool for OpenAI-compatible endpoints

Created 4 months ago
426 stars

Top 68.8% on SourcePulse

GitHubView on GitHub
Project Summary

This tool addresses the challenge of benchmarking Large Language Model (LLM) inference endpoints, particularly for backends beyond llama.cpp and for accurately measuring prompt processing speeds at varying context lengths. It targets engineers, researchers, and power users evaluating LLM performance, offering a standardized method to assess prompt processing and token generation speeds, TTFR, est_ppt, and e2e_ttft across different OpenAI-compatible services.

How It Works

llama-benchy operates by sending requests to OpenAI-compatible LLM endpoints, systematically varying prompt lengths, generation lengths, and context depths. It measures key performance indicators like Prompt Processing (pp) and Token Generation (tg) speeds, alongside Time To First Response (TTFR), Estimated Prompt Processing Time (est_ppt), and End-to-End Time To First Token (e2e_ttft). The tool leverages HuggingFace tokenizers for accuracy and uses realistic text from Project Gutenberg for prompt generation, aiding in the evaluation of speculative decoding and MTP. A notable feature is its ability to benchmark prefix caching performance, providing insights into how well inference servers handle repeated contexts.

Quick Start & Requirements

Installation is recommended using uv. The simplest way to run is via uvx llama-benchy --base-url <ENDPOINT_URL> --model <MODEL_NAME>. Alternatively, clone the repository and install using uv pip install -e . within a virtual environment. Key requirements include an OpenAI-compatible LLM endpoint.

Highlighted Details

  • Measures prompt processing and token generation speeds across configurable context depths.
  • Reports TTFR, est_ppt, and e2e_ttft with flexible latency measurement modes ('api', 'generation').
  • Supports concurrent requests to benchmark throughput under load.
  • Includes specific benchmarking for prefix caching effectiveness.
  • Outputs results in Markdown, JSON, or CSV formats, with options for detailed time-series data.

Maintenance & Community

The project shows recent development activity, with a version dated February 6, 2026. No specific community links (e.g., Discord, Slack) or notable contributors are detailed in the provided README.

Licensing & Compatibility

The license type is not explicitly stated in the provided README. This omission requires further investigation for commercial use or closed-source integration.

Limitations & Caveats

The tool currently only evaluates against /v1/chat/completions endpoints. The absence of a stated license is a significant caveat for adoption.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
4
Star History
106 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
12 more.

mistral.rs by EricLBuehler

0.3%
7k
LLM inference engine for blazing fast performance
Created 2 years ago
Updated 3 days ago
Feedback? Help us improve.