llm-analysis by cli99

CLI tool for LLM latency/memory analysis during training/inference

Created 2 years ago

475 stars

Top 64.2% on SourcePulse

View on GitHub

2 Experts Love This Project

Ying Sheng

Coauthor of SGLang

Stas Bekman

Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake

Project Summary

This project provides a Python library for estimating the latency and memory usage of Transformer models during training and inference. It targets researchers and engineers who need to theoretically evaluate different LLM configurations, hardware setups, and parallelism strategies to optimize system performance and cost.

How It Works

The library models latency and memory based on user-defined configurations for model architecture, GPU specifications, data types, and parallelism schemes (Tensor, Pipeline, Sequence, Expert, Data Parallelism). It leverages formulas and equations commonly found in research papers, automating calculations that would otherwise be done manually. The approach allows for theoretical "what-if" analysis to understand the impact of various optimizations like quantization or parallelism.

Quick Start & Requirements

Install via pip: pip install llm-analysis
Install from source: pip install . or poetry install
Supports Hugging Face model names (requires updated transformers library).
Official documentation and examples are available.

Highlighted Details

Supports Tensor Parallelism, Pipeline Parallelism, Sequence Parallelism, Expert Parallelism, and Data Parallelism (including DeepSpeed ZeRO and FSDP).
Models various activation recomputation strategies for memory optimization.
Supports data types from FP32 down to INT4.
Can estimate costs in GPU-hours for training and inference.

Maintenance & Community

Contributions and feedback are welcome.
Uses pre-commit for code formatting.
Links to relevant research papers and projects (Megatron-LM, DeepSpeed, FasterTransformer) are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

Provides lower-bound estimations, assuming perfect overlapping of compute/memory operations and maximum memory reuse for inference.
Communication costs for Data Parallelism and Pipeline Parallelism are currently ignored or simplified, potentially overestimating performance.
Parameter-efficient fine-tuning (PEFT) methods are not yet supported.
FP8 datatype support and CPU offloading analysis are planned features.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days