rtx6kpro by local-inference-lab

Running large LLMs on PCIe GPUs without NVLink

Created 4 months ago

665 stars

Top 49.8% on SourcePulse

Project Summary

Summary

This repository serves as a community-sourced knowledge base for deploying large language models (LLMs) like Qwen3.5-397B, Kimi-K2.5, and GLM-5 on NVIDIA RTX 6000 Pro (Blackwell SM120) GPUs. It addresses the challenge of running massive models across multiple PCIe-connected GPUs without NVLink, targeting users with high-end workstation or server hardware. The project offers practical insights, performance benchmarks, and configuration details derived from extensive community experimentation, enabling efficient LLM inference on non-NVLink setups.

How It Works

The core approach focuses on optimizing LLM inference across multiple RTX 6000 Pro GPUs (2x, 4x, 8x configurations) connected via PCIe 5.0, bypassing the need for NVLink. It details specific hardware topologies, including the use of PCIe switches (Broadcom, c-payne) and motherboard configurations (ASUS ESC8000A-E13P, ASRock WRX90) to manage inter-GPU communication. The project leverages and configures popular inference engines such as vLLM and SGLang, employing techniques like MTP (Multi-Tenant Processing), DCP (Dynamic Context Partitioning), and NVFP4 quantization to maximize throughput and handle large models and contexts efficiently.

Quick Start & Requirements

Primary Install/Run: No single command provided; relies on community builds and Docker images. Setup involves configuring specific hardware and software environments.
Prerequisites: NVIDIA RTX 6000 Pro (Blackwell SM120) GPUs, PCIe 5.0 x16 slots, specific motherboards (e.g., ASUS ESC8000A-E13P, ASRock WRX90), potentially PCIe switches, and compatible CPUs (AMD EPYC Turin/Genoa). Specific inference engine configurations (vLLM, SGLang) are required.
Links:
- Models: https://github.com/voipmonitor/rtx6kpro/wiki/Models
- Hardware & Topology: https://github.com/voipmonitor/rtx6kpro/wiki/PCIe-Topology
- Inference Engines: https://github.com/voipmonitor/rtx6kpro/wiki/Inference-Engines

Highlighted Details

MTP=2 is identified as the optimal setting for throughput (+51-72%), with MTP>3 noted as unstable.
A specific NCCL graph XML fix is crucial for Turin platforms, improving performance by 5-11% over default configurations.
PCIe switches significantly reduce single-batch latency, improving token generation rates (e.g., 101 tok/s for Kimi K2.5 with switches vs. 60 tok/s without).
BF16 KV cache is mandatory for GLM-5 on SM120 GPUs; FP8 results in garbled output.
NVFP4 quantization offers a 2x decode speedup over FP8 for supported models on SM120.

Maintenance & Community

This wiki is synthesized from approximately 5,000 Discord messages and community experimentation. Contributions via issues or pull requests are encouraged. The project was generated in March 2026, with data sourced from a community Discord server.

Licensing & Compatibility

The provided README text does not specify a software license. Compatibility for commercial use or closed-source linking is undetermined without a license.

Limitations & Caveats

For GLM-5 on SM120, SGLang is the only viable inference engine due to vLLM's lack of SM120-compatible MLA and sparse attention backends. Running GLM-5 with FP8 precision results in garbled output, necessitating BF16 KV cache. The project focuses exclusively on RTX 6000 Pro (Blackwell SM120) GPUs and PCIe-based interconnects, excluding NVLink configurations.

rtx6kpro by local-inference-lab

Explore Similar Projects

ntransformer by xaskasdf

Lvllm by guqiong96

bw24 by avifenesh

eLLM by lucienhuangfu

shard by leyten

amd-strix-halo-vllm-toolboxes by kyuz0

1Cat-vLLM by 1CatAI

local-llm by jamesob

S-LoRA by S-LoRA

TileKernels by deepseek-ai

lucebox by Luce-Org

fastllm by ztxz16