gpu_poor  by RahulSChand

CLI tool for LLM memory and throughput estimation

Created 2 years ago
1,382 stars

Top 29.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides a tool to estimate the GPU memory requirements and token/s performance for any Large Language Model (LLM), targeting users who want to determine LLM compatibility with their hardware. It offers insights into fitting models, quantization strategies, and fine-tuning feasibility.

How It Works

The tool calculates memory usage by summing model size, KV cache, activation memory, gradient/optimizer memory, and CUDA overhead. It estimates token/s based on these memory constraints and compute capabilities. The approach accounts for various quantization methods (GGML, bitsandbytes, QLoRA) and inference frameworks (vLLM, llama.cpp, Hugging Face), providing a breakdown of memory allocation for training and inference scenarios.

Quick Start & Requirements

Highlighted Details

  • Supports GGML, bitsandbytes, and QLoRA quantization.
  • Estimates token/s, prompt processing time, and fine-tuning iteration time.
  • Breaks down GPU memory usage into categories like KV Cache, Model Size, and Activations.
  • Aims for accuracy within 500MB for memory estimations.

Maintenance & Community

  • Project is actively maintained by RahulSChand.
  • Issue tracking available on GitHub.

Licensing & Compatibility

  • License: MIT.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The accuracy of estimations can vary based on specific model configurations, CUDA versions, and quantization implementations. The project is continuously being improved, with features like vLLM token/s support and AWQ quantization still pending.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.7%
995
LLM inference engine for diverse applications
Created 2 years ago
Updated 14 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.1%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 month ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

4.6%
7k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.