Discover and explore top open-source AI tools and projects—updated daily.
local-inference-labRunning large LLMs on PCIe GPUs without NVLink
Top 70.8% on SourcePulse
Summary
This repository serves as a community-sourced knowledge base for deploying large language models (LLMs) like Qwen3.5-397B, Kimi-K2.5, and GLM-5 on NVIDIA RTX 6000 Pro (Blackwell SM120) GPUs. It addresses the challenge of running massive models across multiple PCIe-connected GPUs without NVLink, targeting users with high-end workstation or server hardware. The project offers practical insights, performance benchmarks, and configuration details derived from extensive community experimentation, enabling efficient LLM inference on non-NVLink setups.
How It Works
The core approach focuses on optimizing LLM inference across multiple RTX 6000 Pro GPUs (2x, 4x, 8x configurations) connected via PCIe 5.0, bypassing the need for NVLink. It details specific hardware topologies, including the use of PCIe switches (Broadcom, c-payne) and motherboard configurations (ASUS ESC8000A-E13P, ASRock WRX90) to manage inter-GPU communication. The project leverages and configures popular inference engines such as vLLM and SGLang, employing techniques like MTP (Multi-Tenant Processing), DCP (Dynamic Context Partitioning), and NVFP4 quantization to maximize throughput and handle large models and contexts efficiently.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
This wiki is synthesized from approximately 5,000 Discord messages and community experimentation. Contributions via issues or pull requests are encouraged. The project was generated in March 2026, with data sourced from a community Discord server.
Licensing & Compatibility
The provided README text does not specify a software license. Compatibility for commercial use or closed-source linking is undetermined without a license.
Limitations & Caveats
For GLM-5 on SM120, SGLang is the only viable inference engine due to vLLM's lack of SM120-compatible MLA and sparse attention backends. Running GLM-5 with FP8 precision results in garbled output, necessitating BF16 KV cache. The project focuses exclusively on RTX 6000 Pro (Blackwell SM120) GPUs and PCIe-based interconnects, excluding NVLink configurations.
14 hours ago
Inactive
S-LoRA
deepseek-ai
ztxz16