llama-throughput-lab by alexziskind1

`llama.cpp` server throughput benchmarking harness

Created 5 months ago

431 stars

Top 68.2% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> This repository provides an interactive launcher and benchmarking harness for llama.cpp server throughput. It enables engineers and researchers to systematically test, sweep parameters, and analyze the performance of llama.cpp deployments under various load conditions, facilitating optimization and adoption decisions.

How It Works

This project offers a dialog-based launcher (./run_llama_tests.py) to configure and execute throughput tests and parameter sweeps for the llama.cpp server. It supports single-request, concurrent, and round-robin load testing (requiring nginx), along with sweeps that explore parameter ranges like threads, concurrency, and instances. The system heavily relies on environment variables for detailed configuration of model paths, server arguments, and test parameters, allowing for deep customization of benchmarking scenarios.

Quick Start & Requirements

Installation: Install dialog package (e.g., sudo apt-get install dialog on Debian/Ubuntu). Clone and build llama.cpp to obtain the llama-server binary.
Prerequisites: A local build of llama.cpp with the llama-server binary, nginx installed (for round-robin tests/sweeps), and a GGUF model file.
Running: Execute ./run_llama_tests.py for the interactive launcher, or run tests/scripts directly using Python (e.g., .venv/bin/python -m unittest tests/test_llama_server_concurrent.py).
Configuration: Primarily via environment variables (e.g., LLAMA_MODEL_PATH, LLAMA_CPP_DIR, LLAMA_SERVER_HOST, LLAMA_CONCURRENCY).

Highlighted Details

Supports distinct test types: single request, concurrent requests, and round-robin (requires nginx).
Benchmark sweeps cover threads (--threads/--threads-http), round-robin configurations (max tokens x concurrency), and full sweeps (instances x parallel x concurrency).
Includes a analyze-data.py script for processing sweep results CSV files, enabling sorting by throughput, errors, and other metrics.
Extensive environment variable support allows fine-grained control over server arguments, model paths, and test parameters.

Maintenance & Community

No specific information regarding maintainers, community channels (e.g., Discord, Slack), or project roadmap is provided in the README.

Licensing & Compatibility

The README does not specify a software license. This absence may pose compatibility concerns for commercial use or integration into closed-source projects.

Limitations & Caveats

A pre-built llama.cpp with the llama-server binary is a mandatory prerequisite. nginx is required for round-robin tests and sweeps. Sweep scripts automatically manage certain flags (--parallel, --batch-size, --ubatch), preventing their direct use via LLAMA_SERVER_ARGS during sweeps. The lack of explicit licensing information is a significant caveat for adoption.

llama-throughput-lab by alexziskind1

Explore Similar Projects

mini-infer by psmarter

llmperf-leaderboard by ray-project

web-bench by bytedance

ollama-benchmark by aidatatools

ToolCall-15 by stevibe

llm-benchmark by lework

genai-bench by sgl-project

neurips_llm_efficiency_challenge by llm-efficiency-challenge

paddler by intentee

guidellm by vllm-project

free-coding-models by vava-nessa

llm-applications by ray-project