MinivLLM  by Wenyueh

Custom LLM inference engine with optimized attention mechanisms

Created 1 month ago
494 stars

Top 62.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary: MinivLLM presents a self-contained, custom implementation of the vLLM inference engine, prioritizing transparent and efficient versions of paged attention and flash attention. It is engineered for technical users—engineers, researchers, and power users—who need to deeply understand, benchmark, or experiment with the core mechanisms driving high-performance Large Language Model (LLM) inference, offering a foundational alternative for learning and customization.

How It Works The project meticulously reimplements key components of vLLM, notably paged attention and flash attention, leveraging Triton for optimized GPU kernel development. Paged attention enhances decoding efficiency by employing a virtual memory-like paging system for KV cache management, significantly reducing fragmentation and improving overall memory utilization during token generation. Concurrently, flash attention, implemented via custom Triton kernels, optimizes the computationally intensive prefilling phase. It achieves this by processing attention computations in blocks, thereby reducing memory requirements from the quadratic O(N²) complexity of standard attention to a linear O(N), overcoming the limitations of traditional PyTorch or naive Triton implementations, especially for long sequences.

Quick Start & Requirements Installation is streamlined using the uv package manager: first, install uv (curl -LsSf https://astral.sh/uv/install.sh | sh), then synchronize dependencies (uv sync). The primary inference engine demo is executed with uv run python main.py. Benchmarking scripts for prefilling and decoding are available via uv run python benchmark_prefilling.py and uv run python benchmark_decoding.py, respectively. Essential prerequisites include a CUDA-capable GPU and a Python environment version 3.11 (exclusive of 3.12). Multi-GPU configurations can be enabled by modifying the world_size parameter within main.py's configuration.

Highlighted Details

  • Features comprehensive benchmarks comparing attention implementations during the critical prefilling phase: PyTorch Standard (O(N²)), Naive Triton (O(N²)), and the memory-efficient Flash Attention (O(N)).
  • Includes detailed benchmarks for the decoding phase, evaluating Naive PyTorch, Optimized PyTorch with batch gathering, and a custom Triton Kernel specifically optimized for paged attention decode operations.
  • Provides a complete inference pipeline demonstration (main.py) that showcases efficient, batched text generation utilizing the custom paged attention and sophisticated KV cache management strategies.
  • The project structure includes dedicated modules for models, engine logic (sequence definition, block management, scheduler, runner), layers, and utilities, facilitating a clear understanding of the inference stack.

Maintenance & Community No specific details regarding notable contributors, sponsorships, community support channels (such as Discord or Slack), or project roadmaps were present in the provided README excerpt.

Licensing & Compatibility The repository's license type and any associated compatibility notes relevant for commercial use or integration with closed-source projects are not explicitly stated in the provided README excerpt.

Limitations & Caveats MinivLLM is presented as a "simple replication" and focuses on core attention mechanisms, implying it may not encompass the full breadth of features or the production-grade robustness found in the original vLLM project. The strict Python version requirement (3.11 to <3.12) could pose compatibility challenges with more recent or diverse development environments.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
10
Issues (30d)
6
Star History
237 stars in the last 30 days

Explore Similar Projects

Starred by Mehdi Amini Mehdi Amini(Author of MLIR; Distinguished Engineer at NVIDIA), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
15 more.

flashinfer by flashinfer-ai

0.8%
5k
Kernel library for LLM serving
Created 2 years ago
Updated 23 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.3%
22k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 17 hours ago
Feedback? Help us improve.