local-llm  by jamesob

Guide to building high-performance local LLM inference systems

Created 2 days ago

New!

773 stars

Top 44.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository guides users on deploying state-of-the-art Large Language Models (LLMs) and Speech-to-Text (STT) models locally. It targets users with substantial hardware budgets ($2k-$40k) seeking to run powerful AI models on-premises. The project offers detailed hardware recommendations, configuration secrets, and Docker-based serving setups to achieve high performance and low latency, bypassing cloud provider concerns.

How It Works

The core strategy maximizes VRAM and inter-GPU communication speed. A high-end setup uses multiple NVIDIA RTX Pro 6000 GPUs (384GB VRAM) connected via a c-payne PCIe Gen4 switch. This switch enables direct peer-to-peer (P2P) GPU communication, bypassing the CPU root complex for faster tensor parallelism. Docker-compose configurations are provided for serving various models, alongside a harness for local STT using whisper-large-v3, aiming for efficient, low-latency inference.

Quick Start & Requirements

  • Primary install/run command: Docker-compose configurations are available in ./runners/ for specific models.
  • Non-default prerequisites: High-end NVIDIA GPUs (e.g., 4x RTX Pro 6000 or 2x RTX 3090), significant VRAM (48GB-384GB), specific server hardware (EPYC, DDR4 ECC), c-payne PCIe Gen4 switch, custom GPU mount fabrication, Linux with specific kernel/GRUB parameters (iommu=off, amd_iommu=off, nomodeset), and systemd services for ACS disable and power limiting.
  • Estimated setup time: Significant BIOS/kernel tuning required; custom fabrication may take a day.
  • Links: rtx6kpro repo: https://github.com/local-inference-lab/rtx6kpro; c-payne switches: https://c-payne.com; Discord: https://discord.gg/QMNvFkuDN.

Highlighted Details

  • Achieves Gen4 line rate P2P bandwidth (27.5 GB/s unidirectional / 50.4 GB/s bidirectional) with sub-microsecond latency between GPUs via a PCIe switch.
  • Provides ready-to-run Docker configurations for models like GLM-5.2-594B (~80 t/s @ 460k context) and STT with whisper-large-v3.
  • Detailed hardware BOMs for $2k (Qwen, STT) and $40k (near-Opus) setups, prioritizing VRAM over base system cost.
  • Includes scripts and guides for P2P GPU communication optimization and performance tuning.

Maintenance & Community

  • Community support is available via Discord: https://discord.gg/QMNvFkuDN.
  • The rtx6kpro repository serves as a frequently updated resource.

Licensing & Compatibility

  • The README does not explicitly state a license for the repository's content or scripts.
  • Compatibility for commercial use is not specified.

Limitations & Caveats

  • The setup demands substantial hardware investment and complex system configuration, including BIOS tuning, kernel parameters, and runtime scripts.
  • Custom hardware fabrication for GPU mounting may be necessary.
  • The guide reflects hardware costs and model availability as of July 2026.
  • Running the high-end rig on a single 110V circuit is noted as "probably unwisely" and requires aggressive power limiting.
Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
774 stars in the last 2 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.1%
2k
System for scalable LoRA adapter serving
Created 2 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.3%
5k
High-performance C++ LLM inference library
Created 3 years ago
Updated 1 week ago
Feedback? Help us improve.