vattention  by microsoft

Memory manager for LLM serving systems

created 1 year ago
405 stars

Top 72.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

vAttention offers a novel memory management system for Large Language Model (LLM) serving, specifically targeting the KV-cache. It decouples virtual and physical memory allocation using CUDA virtual memory APIs, enabling on-demand physical memory allocation while maintaining KV-cache contiguity in virtual memory. This approach allows dynamic memory allocation for unmodified attention kernels, differing from PagedAttention which requires custom kernel rewrites. vAttention claims performance improvements, particularly for prefill-bound workloads.

How It Works

vAttention leverages CUDA's Unified Virtual Memory (UVM) to manage KV-cache memory. It allocates virtual memory for the entire KV-cache upfront, creating contiguous virtual tensors. Physical memory is then allocated on demand as requests are processed, mapped to these virtual tensors. This contrasts with PagedAttention's user-space paging, which necessitates kernel modifications. The system supports both asynchronous (overlapping allocation with compute) and synchronous memory allocation.

Quick Start & Requirements

  • Installation: Requires PyTorch 2.3.0 and CUDA 12.1+. Tested with Linux, A100 GPUs, and Python 3.10.
  • Setup:
    1. Create and activate a conda environment (conda create -n vattn python=3.10, conda activate vattn).
    2. Download and extract libtorch-shared-with-deps-2.3.0+cu121.zip.
    3. Build Sarathi-Serve: cd sarathi-lean/ && pip install -e . --extra-index-url https://flashinfer.ai/whl/cu121/torch2.3/.
    4. Build vAttention: cd ../vattention/ && LIBTORCH_PATH=<path_to_libtorch> python setup.py install.
  • Resources: Requires downloading and extracting libtorch.
  • Documentation: Sarathi-Serve, paper.

Highlighted Details

  • Implements dynamic memory management for LLM KV-caches without modifying attention kernels.
  • Offers an alternative to PagedAttention, potentially improving performance in prefill-bound scenarios.
  • Integrates with Sarathi-Serve, providing benchmark scripts for dynamic and static workloads.
  • Supports FlashAttention and FlashInfer backends with configurable page sizes (64KB to 2MB).
  • Includes an OpenAI-compatible API for benchmarking with tools like Metron.

Maintenance & Community

  • This repository is a research prototype, originally forked from Sarathi-Serve, which itself is a fork of vLLM.
  • Feature parity with open-source vLLM is not complete.
  • Citation details are provided for academic use.

Licensing & Compatibility

  • The specific license is not explicitly stated in the README, but the project is associated with Microsoft. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

  • Requires replacing default CUDA UVM drivers for page sizes smaller than 2MB.
  • The project is described as a research prototype with incomplete feature parity compared to vLLM.
Health Check
Last commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
44 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tobi Lutke Tobi Lutke(Cofounder of Shopify), and
27 more.

vllm by vllm-project

1.0%
54k
LLM serving engine for high-throughput, memory-efficient inference
created 2 years ago
updated 10 hours ago
Feedback? Help us improve.