MinivLLM by Wenyueh

Custom LLM inference engine with optimized attention mechanisms

Created 6 months ago

889 stars

Top 39.8% on SourcePulse

Project Summary

Summary: MinivLLM presents a self-contained, custom implementation of the vLLM inference engine, prioritizing transparent and efficient versions of paged attention and flash attention. It is engineered for technical users—engineers, researchers, and power users—who need to deeply understand, benchmark, or experiment with the core mechanisms driving high-performance Large Language Model (LLM) inference, offering a foundational alternative for learning and customization.

How It Works The project meticulously reimplements key components of vLLM, notably paged attention and flash attention, leveraging Triton for optimized GPU kernel development. Paged attention enhances decoding efficiency by employing a virtual memory-like paging system for KV cache management, significantly reducing fragmentation and improving overall memory utilization during token generation. Concurrently, flash attention, implemented via custom Triton kernels, optimizes the computationally intensive prefilling phase. It achieves this by processing attention computations in blocks, thereby reducing memory requirements from the quadratic O(N²) complexity of standard attention to a linear O(N), overcoming the limitations of traditional PyTorch or naive Triton implementations, especially for long sequences.

Quick Start & Requirements Installation is streamlined using the uv package manager: first, install uv (curl -LsSf https://astral.sh/uv/install.sh | sh), then synchronize dependencies (uv sync). The primary inference engine demo is executed with uv run python main.py. Benchmarking scripts for prefilling and decoding are available via uv run python benchmark_prefilling.py and uv run python benchmark_decoding.py, respectively. Essential prerequisites include a CUDA-capable GPU and a Python environment version 3.11 (exclusive of 3.12). Multi-GPU configurations can be enabled by modifying the world_size parameter within main.py's configuration.

Highlighted Details

Features comprehensive benchmarks comparing attention implementations during the critical prefilling phase: PyTorch Standard (O(N²)), Naive Triton (O(N²)), and the memory-efficient Flash Attention (O(N)).
Includes detailed benchmarks for the decoding phase, evaluating Naive PyTorch, Optimized PyTorch with batch gathering, and a custom Triton Kernel specifically optimized for paged attention decode operations.
Provides a complete inference pipeline demonstration (main.py) that showcases efficient, batched text generation utilizing the custom paged attention and sophisticated KV cache management strategies.
The project structure includes dedicated modules for models, engine logic (sequence definition, block management, scheduler, runner), layers, and utilities, facilitating a clear understanding of the inference stack.

Maintenance & Community No specific details regarding notable contributors, sponsorships, community support channels (such as Discord or Slack), or project roadmaps were present in the provided README excerpt.

Licensing & Compatibility The repository's license type and any associated compatibility notes relevant for commercial use or integration with closed-source projects are not explicitly stated in the provided README excerpt.

Limitations & Caveats MinivLLM is presented as a "simple replication" and focuses on core attention mechanisms, implying it may not encompass the full breadth of features or the production-grade robustness found in the original vLLM project. The strict Python version requirement (3.11 to <3.12) could pose compatibility challenges with more recent or diverse development environments.

MinivLLM by Wenyueh

Explore Similar Projects

flex-nano-vllm by changjonathanc

MagicPIG by Infini-AI-Lab

native-sparse-attention-triton by XunhaoLai

simple-llm by naklecha

MSA by MiniMax-AI

vattention by microsoft

FlashRT by flashrt-project

Block-Sparse-Attention by mit-han-lab

tiny-vllm by jmaczan

Awesome-LLM-Inference by xlite-dev

flashinfer by flashinfer-ai

flash-attention by Dao-AILab