LLMServingSim by casys-kaist

Unified simulator for heterogeneous LLM serving infrastructure

Created 1 year ago

254 stars

Top 99.0% on SourcePulse

Project Summary

Summary

LLMServingSim 2.0 is a unified simulator for heterogeneous and disaggregated LLM serving infrastructure. It empowers engineers and researchers to accurately predict and analyze LLM serving performance across diverse hardware and parallelism strategies, aiding in infrastructure design and optimization.

How It Works

The simulator integrates a vLLM-based layerwise profiler capturing real CUDA kernel timings. This detailed performance data feeds a core engine that models heterogeneous and disaggregated LLM serving. Key features include skew-aware attention for heterogeneous decode batches, multi-hardware support, and per-rank Mixture-of-Experts (MoE) latency modeling with DP+EP parallelism via ASTRA-Sim. It also supports vLLM-style request routing.

Quick Start & Requirements

Installation uses Docker (scripts/docker-sim.sh, scripts/docker-vllm.sh) or a bare-metal vLLM installer (scripts/install-vllm.sh). ASTRA-Sim compilation is via ./scripts/compile.sh. Configurations (configs/cluster/) define topology, hardware, memory, and interconnects, supporting per-layer placement and PIM. Supported hardware includes RTXPRO6000, with profiling data for Llama-3.1-8B and Qwen3 variants. Datasets are JSONL (workloads/).

Highlighted Details

vLLM-based layerwise profiler capturing real CUDA kernel timings.
Skew-aware attention and multi-hardware support (e.g., RTXPRO6000).
Advanced MoE latency modeling with DP+EP parallelism and ASTRA-Sim integration.
Agentic session support for complex, closed-loop workloads.
FP8 KV cache simulation for reduced memory footprint.
Chunked prefill enabled by default.
End-to-end validation suite comparing simulator against vLLM.

Maintenance & Community

The project is under active development, with the current branch noted as potentially unstable. Contributions via pull requests are welcomed. Published in CAL 2025 and IISWC 2024, indicating academic relevance.

Licensing & Compatibility

The README does not explicitly state a software license, requiring further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

The current development branch is marked as potentially unstable. The absence of a stated software license is a significant caveat for adoption, particularly in commercial contexts.

LLMServingSim by casys-kaist

Explore Similar Projects

Awesome-KV-Cache-Management by TreeAI-Lab

eLLM by lucienhuangfu

inferrs by ericcurtin

llm-benchmark by lework

Nanoflow by efeslab

sarathi-serve by microsoft

prima.cpp by Lizonghang

kvcached by ovg-project

amd-strix-halo-toolboxes by kyuz0

picolm by RightNow-AI

LiteRT-LM by google-ai-edge

LMCache by LMCache