llm-analysis  by cli99

CLI tool for LLM latency/memory analysis during training/inference

Created 2 years ago
455 stars

Top 66.5% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a Python library for estimating the latency and memory usage of Transformer models during training and inference. It targets researchers and engineers who need to theoretically evaluate different LLM configurations, hardware setups, and parallelism strategies to optimize system performance and cost.

How It Works

The library models latency and memory based on user-defined configurations for model architecture, GPU specifications, data types, and parallelism schemes (Tensor, Pipeline, Sequence, Expert, Data Parallelism). It leverages formulas and equations commonly found in research papers, automating calculations that would otherwise be done manually. The approach allows for theoretical "what-if" analysis to understand the impact of various optimizations like quantization or parallelism.

Quick Start & Requirements

  • Install via pip: pip install llm-analysis
  • Install from source: pip install . or poetry install
  • Supports Hugging Face model names (requires updated transformers library).
  • Official documentation and examples are available.

Highlighted Details

  • Supports Tensor Parallelism, Pipeline Parallelism, Sequence Parallelism, Expert Parallelism, and Data Parallelism (including DeepSpeed ZeRO and FSDP).
  • Models various activation recomputation strategies for memory optimization.
  • Supports data types from FP32 down to INT4.
  • Can estimate costs in GPU-hours for training and inference.

Maintenance & Community

  • Contributions and feedback are welcome.
  • Uses pre-commit for code formatting.
  • Links to relevant research papers and projects (Megatron-LM, DeepSpeed, FasterTransformer) are provided.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • Provides lower-bound estimations, assuming perfect overlapping of compute/memory operations and maximum memory reuse for inference.
  • Communication costs for Data Parallelism and Pipeline Parallelism are currently ignored or simplified, potentially overestimating performance.
  • Parameter-efficient fine-tuning (PEFT) methods are not yet supported.
  • FP8 datatype support and CPU offloading analysis are planned features.
Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

xTuring by stochasticai

0.0%
3k
SDK for fine-tuning and customizing open-source LLMs
Created 2 years ago
Updated 1 day ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.