gpu_poor  by RahulSChand

CLI tool for LLM memory and throughput estimation

Created 2 years ago
1,399 stars

Top 28.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides a tool to estimate the GPU memory requirements and token/s performance for any Large Language Model (LLM), targeting users who want to determine LLM compatibility with their hardware. It offers insights into fitting models, quantization strategies, and fine-tuning feasibility.

How It Works

The tool calculates memory usage by summing model size, KV cache, activation memory, gradient/optimizer memory, and CUDA overhead. It estimates token/s based on these memory constraints and compute capabilities. The approach accounts for various quantization methods (GGML, bitsandbytes, QLoRA) and inference frameworks (vLLM, llama.cpp, Hugging Face), providing a breakdown of memory allocation for training and inference scenarios.

Quick Start & Requirements

Highlighted Details

  • Supports GGML, bitsandbytes, and QLoRA quantization.
  • Estimates token/s, prompt processing time, and fine-tuning iteration time.
  • Breaks down GPU memory usage into categories like KV Cache, Model Size, and Activations.
  • Aims for accuracy within 500MB for memory estimations.

Maintenance & Community

  • Project is actively maintained by RahulSChand.
  • Issue tracking available on GitHub.

Licensing & Compatibility

  • License: MIT.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The accuracy of estimations can vary based on specific model configurations, CUDA versions, and quantization implementations. The project is continuously being improved, with features like vLLM token/s support and AWQ quantization still pending.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
2 more.

TileKernels by deepseek-ai

1.2%
2k
Optimized GPU kernels for LLM operations
Created 1 month ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.8%
5k
High-performance C++ LLM inference library
Created 3 years ago
Updated 13 hours ago
Feedback? Help us improve.