rtx6kpro  by local-inference-lab

Running large LLMs on PCIe GPUs without NVLink

Created 3 months ago
410 stars

Top 70.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This repository serves as a community-sourced knowledge base for deploying large language models (LLMs) like Qwen3.5-397B, Kimi-K2.5, and GLM-5 on NVIDIA RTX 6000 Pro (Blackwell SM120) GPUs. It addresses the challenge of running massive models across multiple PCIe-connected GPUs without NVLink, targeting users with high-end workstation or server hardware. The project offers practical insights, performance benchmarks, and configuration details derived from extensive community experimentation, enabling efficient LLM inference on non-NVLink setups.

How It Works

The core approach focuses on optimizing LLM inference across multiple RTX 6000 Pro GPUs (2x, 4x, 8x configurations) connected via PCIe 5.0, bypassing the need for NVLink. It details specific hardware topologies, including the use of PCIe switches (Broadcom, c-payne) and motherboard configurations (ASUS ESC8000A-E13P, ASRock WRX90) to manage inter-GPU communication. The project leverages and configures popular inference engines such as vLLM and SGLang, employing techniques like MTP (Multi-Tenant Processing), DCP (Dynamic Context Partitioning), and NVFP4 quantization to maximize throughput and handle large models and contexts efficiently.

Quick Start & Requirements

Highlighted Details

  • MTP=2 is identified as the optimal setting for throughput (+51-72%), with MTP>3 noted as unstable.
  • A specific NCCL graph XML fix is crucial for Turin platforms, improving performance by 5-11% over default configurations.
  • PCIe switches significantly reduce single-batch latency, improving token generation rates (e.g., 101 tok/s for Kimi K2.5 with switches vs. 60 tok/s without).
  • BF16 KV cache is mandatory for GLM-5 on SM120 GPUs; FP8 results in garbled output.
  • NVFP4 quantization offers a 2x decode speedup over FP8 for supported models on SM120.

Maintenance & Community

This wiki is synthesized from approximately 5,000 Discord messages and community experimentation. Contributions via issues or pull requests are encouraged. The project was generated in March 2026, with data sourced from a community Discord server.

Licensing & Compatibility

The provided README text does not specify a software license. Compatibility for commercial use or closed-source linking is undetermined without a license.

Limitations & Caveats

For GLM-5 on SM120, SGLang is the only viable inference engine due to vLLM's lack of SM120-compatible MLA and sparse attention backends. Running GLM-5 with FP8 precision results in garbled output, necessitating BF16 KV cache. The project focuses exclusively on RTX 6000 Pro (Blackwell SM120) GPUs and PCIe-based interconnects, excluding NVLink configurations.

Health Check
Last Commit

14 hours ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
92 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0%
2k
System for scalable LoRA adapter serving
Created 2 years ago
Updated 2 years ago
Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
2 more.

TileKernels by deepseek-ai

0.5%
2k
Optimized GPU kernels for LLM operations
Created 1 month ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.3%
5k
High-performance C++ LLM inference library
Created 3 years ago
Updated 16 hours ago
Feedback? Help us improve.