vidur  by microsoft

LLM inference system simulator

created 1 year ago
413 stars

Top 71.9% on sourcepulse

GitHubView on GitHub
Project Summary

Vidur is a high-fidelity LLM inference system simulator designed for researchers and engineers. It enables detailed performance analysis, capacity planning, and rapid prototyping of new scheduling algorithms and optimizations without requiring direct GPU access for most testing.

How It Works

Vidur simulates LLM inference by modeling request arrival, scheduling, execution, and resource utilization. It supports various workload traces and synthetic request generation, allowing users to evaluate metrics like Time To First Token (TTFT) and Total Request Time. The simulator's extensibility allows for the integration of novel scheduling algorithms and optimization techniques, such as speculative decoding, offering a flexible platform for system-level LLM research.

Quick Start & Requirements

  • Install: Create a mamba environment using mamba env create -p ./env -f ./environment.yml or a venv environment with python -m pip install -r requirements.txt.
  • Prerequisites: Python 3.10+ recommended. Optional wandb integration for logging.
  • Resources: Requires significant disk space for traces and simulation outputs. GPU access is only needed for initial profiling.
  • Docs: MLSys'24 paper and talk

Highlighted Details

  • Supports popular models like Llama-3, Llama-2, CodeLlama, InternLM, and Qwen.
  • Models tensor and pipeline parallelism configurations across various NVIDIA GPU architectures (A100, H100).
  • Outputs detailed simulation metrics and Chrome traces for in-depth analysis.
  • Extensible architecture for adding new models, SKUs, and scheduling algorithms.

Maintenance & Community

  • Developed by Microsoft.
  • Contributions are welcome via pull requests, subject to a Contributor License Agreement (CLA).
  • Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Permissive license allows for commercial use and integration with closed-source projects.

Limitations & Caveats

The simulator's accuracy is dependent on the fidelity of its execution time predictor, which may require initial profiling on target hardware. Support for specific hardware configurations (e.g., H100, 8xA40) is not universal across all models.

Health Check
Last commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
5
Issues (30d)
12
Star History
45 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 22 hours ago
Feedback? Help us improve.