GPU benchmark for LLM inference using llama.cpp
Top 25.4% on sourcepulse
This repository provides comprehensive benchmarks for Large Language Model (LLM) inference speed across various NVIDIA GPUs and Apple Silicon hardware, using llama.cpp
. It targets engineers and researchers evaluating hardware for LLM deployment, offering data-driven insights to optimize performance and cost.
How It Works
The project leverages llama.cpp
to test inference speeds for LLaMA models (specifically LLaMA 3) on diverse hardware configurations. Benchmarks cover both text generation (TG) and prompt processing (PP) speeds in tokens/second for different model sizes (8B, 70B) and quantization levels (e.g., Q4_K_M, F16). The methodology includes detailed tables comparing performance across various NVIDIA gaming and professional GPUs, as well as Apple's M1, M2, and M3 series chips.
Quick Start & Requirements
make clean && LLAMA_CUBLAS=1 make -j
make clean && make -j
llama.cpp
build tools, CUDA Toolkit (for NVIDIA), potentially Python for model access.Highlighted Details
Maintenance & Community
The project is maintained by XiongjieDai. It references ggerganov/llama.cpp
and shawwn
for model weights. Users are encouraged to star the repository and contact the author with advice.
Licensing & Compatibility
The repository itself does not explicitly state a license. However, it relies on llama.cpp
, which is typically released under a permissive MIT license. Compatibility for commercial use depends on the licenses of the LLM models used and llama.cpp
.
Limitations & Caveats
Benchmarks are snapshots from May 2024 and may not reflect the absolute latest hardware or software optimizations. "OOM" (Out Of Memory) is frequently reported for larger models on GPUs with insufficient VRAM, highlighting memory constraints as a primary bottleneck. Performance can vary based on specific system configurations and driver versions.
1 year ago
1 day