gpu_poor by RahulSChand

CLI tool for LLM memory and throughput estimation

Created 2 years ago

1,382 stars

Top 29.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Ying Sheng

Coauthor of SGLang

Project Summary

This project provides a tool to estimate the GPU memory requirements and token/s performance for any Large Language Model (LLM), targeting users who want to determine LLM compatibility with their hardware. It offers insights into fitting models, quantization strategies, and fine-tuning feasibility.

How It Works

The tool calculates memory usage by summing model size, KV cache, activation memory, gradient/optimizer memory, and CUDA overhead. It estimates token/s based on these memory constraints and compute capabilities. The approach accounts for various quantization methods (GGML, bitsandbytes, QLoRA) and inference frameworks (vLLM, llama.cpp, Hugging Face), providing a breakdown of memory allocation for training and inference scenarios.

Quick Start & Requirements

Install via pip install gpu_poor.
Requires Python 3.x.
Official Demo: https://rahulschand.github.io/gpu_poor/

Highlighted Details

Supports GGML, bitsandbytes, and QLoRA quantization.
Estimates token/s, prompt processing time, and fine-tuning iteration time.
Breaks down GPU memory usage into categories like KV Cache, Model Size, and Activations.
Aims for accuracy within 500MB for memory estimations.

Maintenance & Community

Project is actively maintained by RahulSChand.
Issue tracking available on GitHub.

Licensing & Compatibility

License: MIT.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

The accuracy of estimations can vary based on specific model configurations, CUDA versions, and quantization implementations. The project is continuously being improved, with features like vLLM token/s support and AWQ quantization still pending.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days