gpu_poor  by RahulSChand

CLI tool for LLM memory and throughput estimation

Created 2 years ago
1,355 stars

Top 29.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides a tool to estimate the GPU memory requirements and token/s performance for any Large Language Model (LLM), targeting users who want to determine LLM compatibility with their hardware. It offers insights into fitting models, quantization strategies, and fine-tuning feasibility.

How It Works

The tool calculates memory usage by summing model size, KV cache, activation memory, gradient/optimizer memory, and CUDA overhead. It estimates token/s based on these memory constraints and compute capabilities. The approach accounts for various quantization methods (GGML, bitsandbytes, QLoRA) and inference frameworks (vLLM, llama.cpp, Hugging Face), providing a breakdown of memory allocation for training and inference scenarios.

Quick Start & Requirements

Highlighted Details

  • Supports GGML, bitsandbytes, and QLoRA quantization.
  • Estimates token/s, prompt processing time, and fine-tuning iteration time.
  • Breaks down GPU memory usage into categories like KV Cache, Model Size, and Activations.
  • Aims for accuracy within 500MB for memory estimations.

Maintenance & Community

  • Project is actively maintained by RahulSChand.
  • Issue tracking available on GitHub.

Licensing & Compatibility

  • License: MIT.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The accuracy of estimations can vary based on specific model configurations, CUDA versions, and quantization implementations. The project is continuously being improved, with features like vLLM token/s support and AWQ quantization still pending.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.