GPU-Benchmarks-on-LLM-Inference by XiongjieDai

GPU benchmark for LLM inference using llama.cpp

Created 2 years ago

1,864 stars

Top 23.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Victor Taelin

Author of Bend, Kind, HVM

Project Summary

This repository provides comprehensive benchmarks for Large Language Model (LLM) inference speed across various NVIDIA GPUs and Apple Silicon hardware, using llama.cpp. It targets engineers and researchers evaluating hardware for LLM deployment, offering data-driven insights to optimize performance and cost.

How It Works

The project leverages llama.cpp to test inference speeds for LLaMA models (specifically LLaMA 3) on diverse hardware configurations. Benchmarks cover both text generation (TG) and prompt processing (PP) speeds in tokens/second for different model sizes (8B, 70B) and quantization levels (e.g., Q4_K_M, F16). The methodology includes detailed tables comparing performance across various NVIDIA gaming and professional GPUs, as well as Apple's M1, M2, and M3 series chips.

Quick Start & Requirements

Build for NVIDIA GPUs: make clean && LLAMA_CUBLAS=1 make -j
Build for Apple Silicon: make clean && make -j
Prerequisites: llama.cpp build tools, CUDA Toolkit (for NVIDIA), potentially Python for model access.
Models: Requires downloading LLaMA model weights (e.g., from Hugging Face).
Resources: Benchmarking requires significant GPU VRAM, especially for larger models and higher precision.

Highlighted Details

Extensive benchmarks for LLaMA 3 on NVIDIA gaming (30-series, 40-series) and professional (RTX Ada, A100, H100) GPUs.
Comparative analysis of Apple Silicon (M1, M2, M3) performance against NVIDIA hardware.
Detailed VRAM requirements and perplexity tables for various quantization methods.
Performance breakdown for text generation and prompt processing.

Maintenance & Community

The project is maintained by XiongjieDai. It references ggerganov/llama.cpp and shawwn for model weights. Users are encouraged to star the repository and contact the author with advice.

Licensing & Compatibility

The repository itself does not explicitly state a license. However, it relies on llama.cpp, which is typically released under a permissive MIT license. Compatibility for commercial use depends on the licenses of the LLM models used and llama.cpp.

Limitations & Caveats

Benchmarks are snapshots from May 2024 and may not reflect the absolute latest hardware or software optimizations. "OOM" (Out Of Memory) is frequently reported for larger models on GPUs with insufficient VRAM, highlighting memory constraints as a primary bottleneck. Performance can vary based on specific system configurations and driver versions.

GPU-Benchmarks-on-LLM-Inference by XiongjieDai

Explore Similar Projects

fp6_llm by usyd-fsalab

neural-speed by intel

llm-inference-benchmark by ninehills

LLaMA_MPS by jankais3r

prima.cpp by Lizonghang

LookaheadDecoding by hao-ai-lab

marlin by IST-DASLab

optimum-nvidia by huggingface

KuiperLLama by zjhellofss

GPTQModel by ModelCloud

ik_llama.cpp by ikawrakow

airllm by lyogavin