Lvllm  by guqiong96

Efficient hybrid CPU-GPU inference for large language models

Created 5 months ago
258 stars

Top 98.0% on SourcePulse

GitHubView on GitHub
Project Summary

LvLLM is a specialized extension of vLLM designed for efficient large model inference, particularly for Mixture-of-Experts (MOE) models. It targets researchers and power users by enabling full utilization of CPU and GPU resources through a novel GPU + NUMA parallel architecture, significantly reducing GPU memory demands and accommodating larger models.

How It Works

LvLLM employs a "GPU + NUMA Dual Parallelism" strategy, supporting hybrid CPU-GPU decoding and prefill modes. Its "VRAM + Memory Load Balancing" allows the total model footprint to span both GPU VRAM and system RAM, effectively enabling the loading of models that would otherwise exceed GPU capacity. Additionally, "GPU Prefill Optimization" runs GPU prefill tasks in parallel with CPU-GPU hybrid decoding, aiming for near 100% GPU utilization. NUMA thread optimization further reduces cross-node communication and improves L3 cache hit rates.

Quick Start & Requirements

  • Primary Install: pip install -e . after cloning the repository and setting up the Python environment.
  • Prerequisites: CUDA 12.9, Python 3.12.11, libnuma-dev (Ubuntu) or numactl-devel (Rocky Linux). Requires x86 CPUs with AVX2+ instruction sets and Nvidia GPUs. PyTorch 2.9.1 is specified.
  • Setup: Involves CUDA installation, environment setup (conda), dependency installation (requirements/build.txt), and compiling LvLLM. Specific environment variables (e.g., LVLLM_MOE_NUMA_ENABLED, LK_THREADS) are critical for configuration.
  • Links: Official quick-start examples are provided for various models (Qwen, MiniMax, Kimi, GLM).

Highlighted Details

  • Supports hybrid inference for MOE models using CPU-GPU parallelism.
  • Accommodates models larger than available VRAM by utilizing system memory.
  • Optimized GPU prefill runs concurrently with decoding for high GPU utilization.
  • Supports FP8 and AWQ 4-bit quantized model formats.
  • NUMA thread optimization minimizes cross-node communication and improves cache performance.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or roadmaps were found in the provided text.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README content. This omission requires further investigation for commercial use or closed-source integration.

Limitations & Caveats

Requires specific hardware (AVX2+ CPUs, Nvidia GPUs) and CUDA 12.9. GGUF model support has been removed. Some models, like DeepSeek-V3.2, are listed as pending testing. For FP8/AWQ MoE models, max_num_batched_tokens is temporarily limited to 32000.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
4
Star History
92 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.2%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.