CLI tool for LLM memory and throughput estimation
Top 30.7% on sourcepulse
This project provides a tool to estimate the GPU memory requirements and token/s performance for any Large Language Model (LLM), targeting users who want to determine LLM compatibility with their hardware. It offers insights into fitting models, quantization strategies, and fine-tuning feasibility.
How It Works
The tool calculates memory usage by summing model size, KV cache, activation memory, gradient/optimizer memory, and CUDA overhead. It estimates token/s based on these memory constraints and compute capabilities. The approach accounts for various quantization methods (GGML, bitsandbytes, QLoRA) and inference frameworks (vLLM, llama.cpp, Hugging Face), providing a breakdown of memory allocation for training and inference scenarios.
Quick Start & Requirements
pip install gpu_poor
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The accuracy of estimations can vary based on specific model configurations, CUDA versions, and quantization implementations. The project is continuously being improved, with features like vLLM token/s support and AWQ quantization still pending.
8 months ago
1 day