Lvllm by guqiong96

Efficient hybrid CPU-GPU inference for large language models

Created 8 months ago

372 stars

Top 76.1% on SourcePulse

Project Summary

LvLLM is a specialized extension of vLLM designed for efficient large model inference, particularly for Mixture-of-Experts (MOE) models. It targets researchers and power users by enabling full utilization of CPU and GPU resources through a novel GPU + NUMA parallel architecture, significantly reducing GPU memory demands and accommodating larger models.

How It Works

LvLLM employs a "GPU + NUMA Dual Parallelism" strategy, supporting hybrid CPU-GPU decoding and prefill modes. Its "VRAM + Memory Load Balancing" allows the total model footprint to span both GPU VRAM and system RAM, effectively enabling the loading of models that would otherwise exceed GPU capacity. Additionally, "GPU Prefill Optimization" runs GPU prefill tasks in parallel with CPU-GPU hybrid decoding, aiming for near 100% GPU utilization. NUMA thread optimization further reduces cross-node communication and improves L3 cache hit rates.

Quick Start & Requirements

Primary Install: pip install -e . after cloning the repository and setting up the Python environment.
Prerequisites: CUDA 12.9, Python 3.12.11, libnuma-dev (Ubuntu) or numactl-devel (Rocky Linux). Requires x86 CPUs with AVX2+ instruction sets and Nvidia GPUs. PyTorch 2.9.1 is specified.
Setup: Involves CUDA installation, environment setup (conda), dependency installation (requirements/build.txt), and compiling LvLLM. Specific environment variables (e.g., LVLLM_MOE_NUMA_ENABLED, LK_THREADS) are critical for configuration.
Links: Official quick-start examples are provided for various models (Qwen, MiniMax, Kimi, GLM).

Highlighted Details

Supports hybrid inference for MOE models using CPU-GPU parallelism.
Accommodates models larger than available VRAM by utilizing system memory.
Optimized GPU prefill runs concurrently with decoding for high GPU utilization.
Supports FP8 and AWQ 4-bit quantized model formats.
NUMA thread optimization minimizes cross-node communication and improves cache performance.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or roadmaps were found in the provided text.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README content. This omission requires further investigation for commercial use or closed-source integration.

Limitations & Caveats

Requires specific hardware (AVX2+ CPUs, Nvidia GPUs) and CUDA 12.9. GGUF model support has been removed. Some models, like DeepSeek-V3.2, are listed as pending testing. For FP8/AWQ MoE models, max_num_batched_tokens is temporarily limited to 32000.

Lvllm by guqiong96

Explore Similar Projects

ntransformer by xaskasdf

fp6_llm by usyd-fsalab

cider by Mininglamp-AI

eLLM by lucienhuangfu

glake by antgroup

gpu_poor by RahulSChand

GPTQModel by ModelCloud

lucebox-hub by Luce-Org

ollm by Mega4alik

fastllm by ztxz16

CTranslate2 by OpenNMT

FlexLLMGen by FMInference