LLMServingSim  by casys-kaist

Unified simulator for heterogeneous LLM serving infrastructure

Created 1 year ago
254 stars

Top 99.0% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

LLMServingSim 2.0 is a unified simulator for heterogeneous and disaggregated LLM serving infrastructure. It empowers engineers and researchers to accurately predict and analyze LLM serving performance across diverse hardware and parallelism strategies, aiding in infrastructure design and optimization.

How It Works

The simulator integrates a vLLM-based layerwise profiler capturing real CUDA kernel timings. This detailed performance data feeds a core engine that models heterogeneous and disaggregated LLM serving. Key features include skew-aware attention for heterogeneous decode batches, multi-hardware support, and per-rank Mixture-of-Experts (MoE) latency modeling with DP+EP parallelism via ASTRA-Sim. It also supports vLLM-style request routing.

Quick Start & Requirements

Installation uses Docker (scripts/docker-sim.sh, scripts/docker-vllm.sh) or a bare-metal vLLM installer (scripts/install-vllm.sh). ASTRA-Sim compilation is via ./scripts/compile.sh. Configurations (configs/cluster/) define topology, hardware, memory, and interconnects, supporting per-layer placement and PIM. Supported hardware includes RTXPRO6000, with profiling data for Llama-3.1-8B and Qwen3 variants. Datasets are JSONL (workloads/).

Highlighted Details

  • vLLM-based layerwise profiler capturing real CUDA kernel timings.
  • Skew-aware attention and multi-hardware support (e.g., RTXPRO6000).
  • Advanced MoE latency modeling with DP+EP parallelism and ASTRA-Sim integration.
  • Agentic session support for complex, closed-loop workloads.
  • FP8 KV cache simulation for reduced memory footprint.
  • Chunked prefill enabled by default.
  • End-to-end validation suite comparing simulator against vLLM.

Maintenance & Community

The project is under active development, with the current branch noted as potentially unstable. Contributions via pull requests are welcomed. Published in CAL 2025 and IISWC 2024, indicating academic relevance.

Licensing & Compatibility

The README does not explicitly state a software license, requiring further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

The current development branch is marked as potentially unstable. The absence of a stated software license is a significant caveat for adoption, particularly in commercial contexts.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
2
Star History
21 stars in the last 30 days

Explore Similar Projects

Starred by Taranjeet Singh Taranjeet Singh(Cofounder of Mem0), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

LMCache by LMCache

0.7%
8k
LLM serving engine extension for reduced TTFT and increased throughput
Created 1 year ago
Updated 10 hours ago
Feedback? Help us improve.