Memory manager for LLM serving systems
Top 72.8% on sourcepulse
vAttention offers a novel memory management system for Large Language Model (LLM) serving, specifically targeting the KV-cache. It decouples virtual and physical memory allocation using CUDA virtual memory APIs, enabling on-demand physical memory allocation while maintaining KV-cache contiguity in virtual memory. This approach allows dynamic memory allocation for unmodified attention kernels, differing from PagedAttention which requires custom kernel rewrites. vAttention claims performance improvements, particularly for prefill-bound workloads.
How It Works
vAttention leverages CUDA's Unified Virtual Memory (UVM) to manage KV-cache memory. It allocates virtual memory for the entire KV-cache upfront, creating contiguous virtual tensors. Physical memory is then allocated on demand as requests are processed, mapped to these virtual tensors. This contrasts with PagedAttention's user-space paging, which necessitates kernel modifications. The system supports both asynchronous (overlapping allocation with compute) and synchronous memory allocation.
Quick Start & Requirements
conda create -n vattn python=3.10
, conda activate vattn
).libtorch-shared-with-deps-2.3.0+cu121.zip
.cd sarathi-lean/ && pip install -e . --extra-index-url https://flashinfer.ai/whl/cu121/torch2.3/
.cd ../vattention/ && LIBTORCH_PATH=<path_to_libtorch> python setup.py install
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
2 months ago
1 week