gpu_poor  by RahulSChand

CLI tool for LLM memory and throughput estimation

created 1 year ago
1,335 stars

Top 30.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides a tool to estimate the GPU memory requirements and token/s performance for any Large Language Model (LLM), targeting users who want to determine LLM compatibility with their hardware. It offers insights into fitting models, quantization strategies, and fine-tuning feasibility.

How It Works

The tool calculates memory usage by summing model size, KV cache, activation memory, gradient/optimizer memory, and CUDA overhead. It estimates token/s based on these memory constraints and compute capabilities. The approach accounts for various quantization methods (GGML, bitsandbytes, QLoRA) and inference frameworks (vLLM, llama.cpp, Hugging Face), providing a breakdown of memory allocation for training and inference scenarios.

Quick Start & Requirements

Highlighted Details

  • Supports GGML, bitsandbytes, and QLoRA quantization.
  • Estimates token/s, prompt processing time, and fine-tuning iteration time.
  • Breaks down GPU memory usage into categories like KV Cache, Model Size, and Activations.
  • Aims for accuracy within 500MB for memory estimations.

Maintenance & Community

  • Project is actively maintained by RahulSChand.
  • Issue tracking available on GitHub.

Licensing & Compatibility

  • License: MIT.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The accuracy of estimations can vary based on specific model configurations, CUDA versions, and quantization implementations. The project is continuously being improved, with features like vLLM token/s support and AWQ quantization still pending.

Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
45 stars in the last 90 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Author of SGLang) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

llm-analysis by cli99

0.2%
441
CLI tool for LLM latency/memory analysis during training/inference
created 2 years ago
updated 3 months ago
Starred by Bojan Tunguz Bojan Tunguz(AI Scientist; Formerly at NVIDIA), Mckay Wrigley Mckay Wrigley(Founder of Takeoff AI), and
8 more.

ggml by ggml-org

0.3%
13k
Tensor library for machine learning
created 2 years ago
updated 3 days ago
Feedback? Help us improve.