FlexKV  by taco-project

LLM inference acceleration via distributed KV cache management

Created 10 months ago
259 stars

Top 97.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

FlexKV addresses the critical challenge of managing KV Cache for high-performance distributed Large Language Model (LLM) inference. Developed by Tencent Cloud's TACO team, it provides a distributed KV store and multi-level cache system designed to significantly boost inference throughput and reduce latency by efficiently utilizing tiered storage. This system is particularly beneficial for large-scale LLM deployments and integrates seamlessly with popular inference frameworks.

How It Works

FlexKV employs a sophisticated three-tiered caching hierarchy: CPU memory, local SSD, and scalable distributed storage. This approach mitigates GPU VRAM limitations by offloading KVCache data. The architecture comprises three core modules: StorageEngine handles block-level KVCache storage and optional block-wise merging for optimized I/O; GlobalCacheEngine acts as the control plane, using a RadixTree for efficient prefix matching and managing cache eviction; and TransferEngine executes data transfers asynchronously using high-performance mechanisms like io_uring and GPU Direct Storage (GDS), enabling overlapping computation and data movement. Distributed KVCache reuse is facilitated via a distributed RadixTree and RDMA-based Mooncake Transfer Engine.

Quick Start & Requirements

  • Dependencies: Requires liburing-dev, libxxhash-dev, and libhiredis-dev.
  • Build: Compile using ./build.sh or ./build.sh --release.
  • Integration:
    • vLLM: FlexKV is now a built-in component (FlexKVConnectorV1) in vLLM mainline (v0.17.2+). See docs/vllm_adapter/README_en.md.
    • NVIDIA Dynamo: Integrated as a native KV Cache Offloading option. See docs/dynamo_integration/README_en.md.
    • TensorRT-LLM: Support is available. See docs/trtllm_adaption/README_en.md.
  • API: Transitioned to a directly-callable library API (v1.0.0).

Highlighted Details

  • Framework Integration: Seamlessly integrated into vLLM, NVIDIA Dynamo, and TensorRT-LLM, supporting TP16 configurations.
  • Performance Enhancements: Features GPU Direct Storage (GDS) for direct SSD-to-GPU transfers, optimizing I/O. Supports RDMA-based Mooncake Transfer Engine for high-performance cross-node KVCache sharing.
  • Distributed Capabilities: Enables distributed KVCache reuse across nodes with a distributed RadixTree and lease mechanism.
  • Monitoring: Includes a zero-intrusion Prometheus-based runtime monitoring framework for cache hit/miss rates, memory status, and transfer statistics.

Maintenance & Community

Developed by Tencent Cloud's TACO team in collaboration with the community. The project's merger into major frameworks like vLLM and NVIDIA Dynamo signifies strong adoption and ongoing relevance. The main branch serves for rapid iteration, while release-* branches provide stable versions.

Licensing & Compatibility

Licensed under the permissive Apache-2.0 License, allowing for broad compatibility with commercial and closed-source applications.

Limitations & Caveats

The README does not explicitly detail limitations. The project is actively developed, with ongoing roadmap items focusing on further distributed query support and latency optimization, suggesting areas of continued enhancement. System-level dependencies are required for building.

Health Check
Last Commit

12 hours ago

Responsiveness

Inactive

Pull Requests (30d)
14
Issues (30d)
2
Star History
24 stars in the last 30 days

Explore Similar Projects

Starred by Taranjeet Singh Taranjeet Singh(Cofounder of Mem0), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

LMCache by LMCache

0.5%
8k
LLM serving engine extension for reduced TTFT and increased throughput
Created 2 years ago
Updated 19 hours ago
Feedback? Help us improve.