llm-analysis  by cli99

CLI tool for LLM latency/memory analysis during training/inference

created 2 years ago
441 stars

Top 68.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a Python library for estimating the latency and memory usage of Transformer models during training and inference. It targets researchers and engineers who need to theoretically evaluate different LLM configurations, hardware setups, and parallelism strategies to optimize system performance and cost.

How It Works

The library models latency and memory based on user-defined configurations for model architecture, GPU specifications, data types, and parallelism schemes (Tensor, Pipeline, Sequence, Expert, Data Parallelism). It leverages formulas and equations commonly found in research papers, automating calculations that would otherwise be done manually. The approach allows for theoretical "what-if" analysis to understand the impact of various optimizations like quantization or parallelism.

Quick Start & Requirements

  • Install via pip: pip install llm-analysis
  • Install from source: pip install . or poetry install
  • Supports Hugging Face model names (requires updated transformers library).
  • Official documentation and examples are available.

Highlighted Details

  • Supports Tensor Parallelism, Pipeline Parallelism, Sequence Parallelism, Expert Parallelism, and Data Parallelism (including DeepSpeed ZeRO and FSDP).
  • Models various activation recomputation strategies for memory optimization.
  • Supports data types from FP32 down to INT4.
  • Can estimate costs in GPU-hours for training and inference.

Maintenance & Community

  • Contributions and feedback are welcome.
  • Uses pre-commit for code formatting.
  • Links to relevant research papers and projects (Megatron-LM, DeepSpeed, FasterTransformer) are provided.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • Provides lower-bound estimations, assuming perfect overlapping of compute/memory operations and maximum memory reuse for inference.
  • Communication costs for Data Parallelism and Pipeline Parallelism are currently ignored or simplified, potentially overestimating performance.
  • Parameter-efficient fine-tuning (PEFT) methods are not yet supported.
  • FP8 datatype support and CPU offloading analysis are planned features.
Health Check
Last commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
36 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.