vattention by microsoft

Memory manager for LLM serving systems

Created 1 year ago

454 stars

Top 66.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Zhuohan Li

Coauthor of vLLM

Project Summary

vAttention offers a novel memory management system for Large Language Model (LLM) serving, specifically targeting the KV-cache. It decouples virtual and physical memory allocation using CUDA virtual memory APIs, enabling on-demand physical memory allocation while maintaining KV-cache contiguity in virtual memory. This approach allows dynamic memory allocation for unmodified attention kernels, differing from PagedAttention which requires custom kernel rewrites. vAttention claims performance improvements, particularly for prefill-bound workloads.

How It Works

vAttention leverages CUDA's Unified Virtual Memory (UVM) to manage KV-cache memory. It allocates virtual memory for the entire KV-cache upfront, creating contiguous virtual tensors. Physical memory is then allocated on demand as requests are processed, mapped to these virtual tensors. This contrasts with PagedAttention's user-space paging, which necessitates kernel modifications. The system supports both asynchronous (overlapping allocation with compute) and synchronous memory allocation.

Quick Start & Requirements

Installation: Requires PyTorch 2.3.0 and CUDA 12.1+. Tested with Linux, A100 GPUs, and Python 3.10.
Setup:
1. Create and activate a conda environment (conda create -n vattn python=3.10, conda activate vattn).
2. Download and extract libtorch-shared-with-deps-2.3.0+cu121.zip.
3. Build Sarathi-Serve: cd sarathi-lean/ && pip install -e . --extra-index-url https://flashinfer.ai/whl/cu121/torch2.3/.
4. Build vAttention: cd ../vattention/ && LIBTORCH_PATH=<path_to_libtorch> python setup.py install.
Resources: Requires downloading and extracting libtorch.
Documentation: Sarathi-Serve, paper.

Highlighted Details

Implements dynamic memory management for LLM KV-caches without modifying attention kernels.
Offers an alternative to PagedAttention, potentially improving performance in prefill-bound scenarios.
Integrates with Sarathi-Serve, providing benchmark scripts for dynamic and static workloads.
Supports FlashAttention and FlashInfer backends with configurable page sizes (64KB to 2MB).
Includes an OpenAI-compatible API for benchmarking with tools like Metron.

Maintenance & Community

This repository is a research prototype, originally forked from Sarathi-Serve, which itself is a fork of vLLM.
Feature parity with open-source vLLM is not complete.
Citation details are provided for academic use.

Licensing & Compatibility

The specific license is not explicitly stated in the README, but the project is associated with Microsoft. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

Requires replacing default CUDA UVM drivers for page sizes smaller than 2MB.
The project is described as a research prototype with incomplete feature parity compared to vLLM.

Health Check

Last Commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days