vidur by microsoft

LLM inference system simulator

Created 2 years ago

513 stars

Top 61.0% on SourcePulse

Project Summary

Vidur is a high-fidelity LLM inference system simulator designed for researchers and engineers. It enables detailed performance analysis, capacity planning, and rapid prototyping of new scheduling algorithms and optimizations without requiring direct GPU access for most testing.

How It Works

Vidur simulates LLM inference by modeling request arrival, scheduling, execution, and resource utilization. It supports various workload traces and synthetic request generation, allowing users to evaluate metrics like Time To First Token (TTFT) and Total Request Time. The simulator's extensibility allows for the integration of novel scheduling algorithms and optimization techniques, such as speculative decoding, offering a flexible platform for system-level LLM research.

Quick Start & Requirements

Install: Create a mamba environment using mamba env create -p ./env -f ./environment.yml or a venv environment with python -m pip install -r requirements.txt.
Prerequisites: Python 3.10+ recommended. Optional wandb integration for logging.
Resources: Requires significant disk space for traces and simulation outputs. GPU access is only needed for initial profiling.
Docs: MLSys'24 paper and talk

Highlighted Details

Supports popular models like Llama-3, Llama-2, CodeLlama, InternLM, and Qwen.
Models tensor and pipeline parallelism configurations across various NVIDIA GPU architectures (A100, H100).
Outputs detailed simulation metrics and Chrome traces for in-depth analysis.
Extensible architecture for adding new models, SKUs, and scheduling algorithms.

Maintenance & Community

Developed by Microsoft.
Contributions are welcome via pull requests, subject to a Contributor License Agreement (CLA).
Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

License: MIT.
Compatibility: Permissive license allows for commercial use and integration with closed-source projects.

Limitations & Caveats

The simulator's accuracy is dependent on the fidelity of its execution time predictor, which may require initial profiling on target hardware. Support for specific hardware configurations (e.g., H100, 8xA40) is not universal across all models.

vidur by microsoft

Explore Similar Projects

compute-optimal-tts by RyanLiu112

ollama-benchmark by aidatatools

sarathi-serve by microsoft

prima.cpp by Lizonghang

llama.go by gotzmann

LLM-Viewer by hahnyuan

llm-finetuning by modal-labs

guidellm by vllm-project

LLMCompiler by SqueezeAILab

tiny-llm by skyzh

LightLLM by ModelTC

MiniCPM by OpenBMB