llama-throughput-lab  by alexziskind1

`llama.cpp` server throughput benchmarking harness

Created 1 month ago
273 stars

Top 94.7% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> This repository provides an interactive launcher and benchmarking harness for llama.cpp server throughput. It enables engineers and researchers to systematically test, sweep parameters, and analyze the performance of llama.cpp deployments under various load conditions, facilitating optimization and adoption decisions.

How It Works

This project offers a dialog-based launcher (./run_llama_tests.py) to configure and execute throughput tests and parameter sweeps for the llama.cpp server. It supports single-request, concurrent, and round-robin load testing (requiring nginx), along with sweeps that explore parameter ranges like threads, concurrency, and instances. The system heavily relies on environment variables for detailed configuration of model paths, server arguments, and test parameters, allowing for deep customization of benchmarking scenarios.

Quick Start & Requirements

  • Installation: Install dialog package (e.g., sudo apt-get install dialog on Debian/Ubuntu). Clone and build llama.cpp to obtain the llama-server binary.
  • Prerequisites: A local build of llama.cpp with the llama-server binary, nginx installed (for round-robin tests/sweeps), and a GGUF model file.
  • Running: Execute ./run_llama_tests.py for the interactive launcher, or run tests/scripts directly using Python (e.g., .venv/bin/python -m unittest tests/test_llama_server_concurrent.py).
  • Configuration: Primarily via environment variables (e.g., LLAMA_MODEL_PATH, LLAMA_CPP_DIR, LLAMA_SERVER_HOST, LLAMA_CONCURRENCY).

Highlighted Details

  • Supports distinct test types: single request, concurrent requests, and round-robin (requires nginx).
  • Benchmark sweeps cover threads (--threads/--threads-http), round-robin configurations (max tokens x concurrency), and full sweeps (instances x parallel x concurrency).
  • Includes a analyze-data.py script for processing sweep results CSV files, enabling sorting by throughput, errors, and other metrics.
  • Extensive environment variable support allows fine-grained control over server arguments, model paths, and test parameters.

Maintenance & Community

No specific information regarding maintainers, community channels (e.g., Discord, Slack), or project roadmap is provided in the README.

Licensing & Compatibility

The README does not specify a software license. This absence may pose compatibility concerns for commercial use or integration into closed-source projects.

Limitations & Caveats

A pre-built llama.cpp with the llama-server binary is a mandatory prerequisite. nginx is required for round-robin tests and sweeps. Sweep scripts automatically manage certain flags (--parallel, --batch-size, --ubatch), preventing their direct use via LLAMA_SERVER_ARGS during sweeps. The lack of explicit licensing information is a significant caveat for adoption.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
10
Issues (30d)
5
Star History
276 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.