Discover and explore top open-source AI tools and projects—updated daily.
guqiong96Efficient hybrid CPU-GPU inference for large language models
Top 98.0% on SourcePulse
LvLLM is a specialized extension of vLLM designed for efficient large model inference, particularly for Mixture-of-Experts (MOE) models. It targets researchers and power users by enabling full utilization of CPU and GPU resources through a novel GPU + NUMA parallel architecture, significantly reducing GPU memory demands and accommodating larger models.
How It Works
LvLLM employs a "GPU + NUMA Dual Parallelism" strategy, supporting hybrid CPU-GPU decoding and prefill modes. Its "VRAM + Memory Load Balancing" allows the total model footprint to span both GPU VRAM and system RAM, effectively enabling the loading of models that would otherwise exceed GPU capacity. Additionally, "GPU Prefill Optimization" runs GPU prefill tasks in parallel with CPU-GPU hybrid decoding, aiming for near 100% GPU utilization. NUMA thread optimization further reduces cross-node communication and improves L3 cache hit rates.
Quick Start & Requirements
pip install -e . after cloning the repository and setting up the Python environment.libnuma-dev (Ubuntu) or numactl-devel (Rocky Linux). Requires x86 CPUs with AVX2+ instruction sets and Nvidia GPUs. PyTorch 2.9.1 is specified.requirements/build.txt), and compiling LvLLM. Specific environment variables (e.g., LVLLM_MOE_NUMA_ENABLED, LK_THREADS) are critical for configuration.Highlighted Details
Maintenance & Community
No specific details regarding maintainers, community channels (like Discord/Slack), or roadmaps were found in the provided text.
Licensing & Compatibility
The repository's license is not explicitly stated in the provided README content. This omission requires further investigation for commercial use or closed-source integration.
Limitations & Caveats
Requires specific hardware (AVX2+ CPUs, Nvidia GPUs) and CUDA 12.9. GGUF model support has been removed. Some models, like DeepSeek-V3.2, are listed as pending testing. For FP8/AWQ MoE models, max_num_batched_tokens is temporarily limited to 32000.
2 days ago
Inactive
Mega4alik
ztxz16
FMInference