CLI tool for LLM latency/memory analysis during training/inference
Top 68.9% on sourcepulse
This project provides a Python library for estimating the latency and memory usage of Transformer models during training and inference. It targets researchers and engineers who need to theoretically evaluate different LLM configurations, hardware setups, and parallelism strategies to optimize system performance and cost.
How It Works
The library models latency and memory based on user-defined configurations for model architecture, GPU specifications, data types, and parallelism schemes (Tensor, Pipeline, Sequence, Expert, Data Parallelism). It leverages formulas and equations commonly found in research papers, automating calculations that would otherwise be done manually. The approach allows for theoretical "what-if" analysis to understand the impact of various optimizations like quantization or parallelism.
Quick Start & Requirements
pip install llm-analysis
pip install .
or poetry install
transformers
library).Highlighted Details
Maintenance & Community
pre-commit
for code formatting.Licensing & Compatibility
Limitations & Caveats
3 months ago
1 week