GPU-Benchmarks-on-LLM-Inference  by XiongjieDai

GPU benchmark for LLM inference using llama.cpp

created 2 years ago
1,713 stars

Top 25.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides comprehensive benchmarks for Large Language Model (LLM) inference speed across various NVIDIA GPUs and Apple Silicon hardware, using llama.cpp. It targets engineers and researchers evaluating hardware for LLM deployment, offering data-driven insights to optimize performance and cost.

How It Works

The project leverages llama.cpp to test inference speeds for LLaMA models (specifically LLaMA 3) on diverse hardware configurations. Benchmarks cover both text generation (TG) and prompt processing (PP) speeds in tokens/second for different model sizes (8B, 70B) and quantization levels (e.g., Q4_K_M, F16). The methodology includes detailed tables comparing performance across various NVIDIA gaming and professional GPUs, as well as Apple's M1, M2, and M3 series chips.

Quick Start & Requirements

  • Build for NVIDIA GPUs: make clean && LLAMA_CUBLAS=1 make -j
  • Build for Apple Silicon: make clean && make -j
  • Prerequisites: llama.cpp build tools, CUDA Toolkit (for NVIDIA), potentially Python for model access.
  • Models: Requires downloading LLaMA model weights (e.g., from Hugging Face).
  • Resources: Benchmarking requires significant GPU VRAM, especially for larger models and higher precision.

Highlighted Details

  • Extensive benchmarks for LLaMA 3 on NVIDIA gaming (30-series, 40-series) and professional (RTX Ada, A100, H100) GPUs.
  • Comparative analysis of Apple Silicon (M1, M2, M3) performance against NVIDIA hardware.
  • Detailed VRAM requirements and perplexity tables for various quantization methods.
  • Performance breakdown for text generation and prompt processing.

Maintenance & Community

The project is maintained by XiongjieDai. It references ggerganov/llama.cpp and shawwn for model weights. Users are encouraged to star the repository and contact the author with advice.

Licensing & Compatibility

The repository itself does not explicitly state a license. However, it relies on llama.cpp, which is typically released under a permissive MIT license. Compatibility for commercial use depends on the licenses of the LLM models used and llama.cpp.

Limitations & Caveats

Benchmarks are snapshots from May 2024 and may not reflect the absolute latest hardware or software optimizations. "OOM" (Out Of Memory) is frequently reported for larger models on GPUs with insufficient VRAM, highlighting memory constraints as a primary bottleneck. Performance can vary based on specific system configurations and driver versions.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
136 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.