vattention  by microsoft

Memory manager for LLM serving systems

Created 1 year ago
454 stars

Top 66.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

vAttention offers a novel memory management system for Large Language Model (LLM) serving, specifically targeting the KV-cache. It decouples virtual and physical memory allocation using CUDA virtual memory APIs, enabling on-demand physical memory allocation while maintaining KV-cache contiguity in virtual memory. This approach allows dynamic memory allocation for unmodified attention kernels, differing from PagedAttention which requires custom kernel rewrites. vAttention claims performance improvements, particularly for prefill-bound workloads.

How It Works

vAttention leverages CUDA's Unified Virtual Memory (UVM) to manage KV-cache memory. It allocates virtual memory for the entire KV-cache upfront, creating contiguous virtual tensors. Physical memory is then allocated on demand as requests are processed, mapped to these virtual tensors. This contrasts with PagedAttention's user-space paging, which necessitates kernel modifications. The system supports both asynchronous (overlapping allocation with compute) and synchronous memory allocation.

Quick Start & Requirements

  • Installation: Requires PyTorch 2.3.0 and CUDA 12.1+. Tested with Linux, A100 GPUs, and Python 3.10.
  • Setup:
    1. Create and activate a conda environment (conda create -n vattn python=3.10, conda activate vattn).
    2. Download and extract libtorch-shared-with-deps-2.3.0+cu121.zip.
    3. Build Sarathi-Serve: cd sarathi-lean/ && pip install -e . --extra-index-url https://flashinfer.ai/whl/cu121/torch2.3/.
    4. Build vAttention: cd ../vattention/ && LIBTORCH_PATH=<path_to_libtorch> python setup.py install.
  • Resources: Requires downloading and extracting libtorch.
  • Documentation: Sarathi-Serve, paper.

Highlighted Details

  • Implements dynamic memory management for LLM KV-caches without modifying attention kernels.
  • Offers an alternative to PagedAttention, potentially improving performance in prefill-bound scenarios.
  • Integrates with Sarathi-Serve, providing benchmark scripts for dynamic and static workloads.
  • Supports FlashAttention and FlashInfer backends with configurable page sizes (64KB to 2MB).
  • Includes an OpenAI-compatible API for benchmarking with tools like Metron.

Maintenance & Community

  • This repository is a research prototype, originally forked from Sarathi-Serve, which itself is a fork of vLLM.
  • Feature parity with open-source vLLM is not complete.
  • Citation details are provided for academic use.

Licensing & Compatibility

  • The specific license is not explicitly stated in the README, but the project is associated with Microsoft. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

  • Requires replacing default CUDA UVM drivers for page sizes smaller than 2MB.
  • The project is described as a research prototype with incomplete feature parity compared to vLLM.
Health Check
Last Commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
14 more.

flashinfer by flashinfer-ai

3.5%
5k
Kernel library for LLM serving
Created 2 years ago
Updated 14 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
60 more.

vllm by vllm-project

0.7%
67k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 14 hours ago
Feedback? Help us improve.