vidur  by microsoft

LLM inference system simulator

Created 1 year ago
442 stars

Top 67.8% on SourcePulse

GitHubView on GitHub
Project Summary

Vidur is a high-fidelity LLM inference system simulator designed for researchers and engineers. It enables detailed performance analysis, capacity planning, and rapid prototyping of new scheduling algorithms and optimizations without requiring direct GPU access for most testing.

How It Works

Vidur simulates LLM inference by modeling request arrival, scheduling, execution, and resource utilization. It supports various workload traces and synthetic request generation, allowing users to evaluate metrics like Time To First Token (TTFT) and Total Request Time. The simulator's extensibility allows for the integration of novel scheduling algorithms and optimization techniques, such as speculative decoding, offering a flexible platform for system-level LLM research.

Quick Start & Requirements

  • Install: Create a mamba environment using mamba env create -p ./env -f ./environment.yml or a venv environment with python -m pip install -r requirements.txt.
  • Prerequisites: Python 3.10+ recommended. Optional wandb integration for logging.
  • Resources: Requires significant disk space for traces and simulation outputs. GPU access is only needed for initial profiling.
  • Docs: MLSys'24 paper and talk

Highlighted Details

  • Supports popular models like Llama-3, Llama-2, CodeLlama, InternLM, and Qwen.
  • Models tensor and pipeline parallelism configurations across various NVIDIA GPU architectures (A100, H100).
  • Outputs detailed simulation metrics and Chrome traces for in-depth analysis.
  • Extensible architecture for adding new models, SKUs, and scheduling algorithms.

Maintenance & Community

  • Developed by Microsoft.
  • Contributions are welcome via pull requests, subject to a Contributor License Agreement (CLA).
  • Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Permissive license allows for commercial use and integration with closed-source projects.

Limitations & Caveats

The simulator's accuracy is dependent on the fidelity of its execution time predictor, which may require initial profiling on target hardware. Support for specific hardware configurations (e.g., H100, 8xA40) is not universal across all models.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
0
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LightLLM by ModelTC

0.5%
4k
Python framework for LLM inference and serving
Created 2 years ago
Updated 14 hours ago
Starred by Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

MiniCPM by OpenBMB

0.4%
8k
Ultra-efficient LLMs for end devices, achieving 5x+ speedup
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.