local-llm by jamesob

Guide to building high-performance local LLM inference systems

Created 2 days ago

New!

773 stars

Top 44.4% on SourcePulse

3 Experts Love This Project

hammer

Jeff Hammerbacher

Cofounder of Cloudera

cournape

David Cournapeau

Author of scikit-learn

joewalnes

Head of Experimental Projects at Stripe

Project Summary

This repository guides users on deploying state-of-the-art Large Language Models (LLMs) and Speech-to-Text (STT) models locally. It targets users with substantial hardware budgets ($2k-$40k) seeking to run powerful AI models on-premises. The project offers detailed hardware recommendations, configuration secrets, and Docker-based serving setups to achieve high performance and low latency, bypassing cloud provider concerns.

How It Works

The core strategy maximizes VRAM and inter-GPU communication speed. A high-end setup uses multiple NVIDIA RTX Pro 6000 GPUs (384GB VRAM) connected via a c-payne PCIe Gen4 switch. This switch enables direct peer-to-peer (P2P) GPU communication, bypassing the CPU root complex for faster tensor parallelism. Docker-compose configurations are provided for serving various models, alongside a harness for local STT using whisper-large-v3, aiming for efficient, low-latency inference.

Quick Start & Requirements

Primary install/run command: Docker-compose configurations are available in ./runners/ for specific models.
Non-default prerequisites: High-end NVIDIA GPUs (e.g., 4x RTX Pro 6000 or 2x RTX 3090), significant VRAM (48GB-384GB), specific server hardware (EPYC, DDR4 ECC), c-payne PCIe Gen4 switch, custom GPU mount fabrication, Linux with specific kernel/GRUB parameters (iommu=off, amd_iommu=off, nomodeset), and systemd services for ACS disable and power limiting.
Estimated setup time: Significant BIOS/kernel tuning required; custom fabrication may take a day.
Links: rtx6kpro repo: https://github.com/local-inference-lab/rtx6kpro; c-payne switches: https://c-payne.com; Discord: https://discord.gg/QMNvFkuDN.

Highlighted Details

Achieves Gen4 line rate P2P bandwidth (27.5 GB/s unidirectional / 50.4 GB/s bidirectional) with sub-microsecond latency between GPUs via a PCIe switch.
Provides ready-to-run Docker configurations for models like GLM-5.2-594B (~80 t/s @ 460k context) and STT with whisper-large-v3.
Detailed hardware BOMs for $2k (Qwen, STT) and $40k (near-Opus) setups, prioritizing VRAM over base system cost.
Includes scripts and guides for P2P GPU communication optimization and performance tuning.

Maintenance & Community

Community support is available via Discord: https://discord.gg/QMNvFkuDN.
The rtx6kpro repository serves as a frequently updated resource.

Licensing & Compatibility

The README does not explicitly state a license for the repository's content or scripts.
Compatibility for commercial use is not specified.

Limitations & Caveats

The setup demands substantial hardware investment and complex system configuration, including BIOS tuning, kernel parameters, and runtime scripts.
Custom hardware fabrication for GPU mounting may be necessary.
The guide reflects hardware costs and model availability as of July 2026.
Running the high-end rig on a single 110V circuit is noted as "probably unwisely" and requires aggressive power limiting.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

1

Star History

774 stars in the last 2 days

Explore Similar Projects

kaiwu by val1813

Auto-tuned local LLM serving for optimal performance

Created 2 months ago

Updated 2 months ago

rtx6kpro by local-inference-lab

Running large LLMs on PCIe GPUs without NVLink

Created 3 months ago

Updated 10 hours ago

shard by leyten

Pipeline-parallel LLM inference across distributed machines

Created 2 weeks ago

Updated 1 day ago

sarathi-serve by microsoft

LLM serving engine for low-latency & high-throughput inference (OSDI'24 paper)

Created 2 years ago

Updated 5 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind),

Roy Frostig

Roy Frostig(Coauthor of JAX; Research Scientist at Google DeepMind), and

1 more.

JetStream by AI-Hypercomputer

LLM inference engine optimized for throughput and memory on XLA devices

Created 2 years ago

Updated 6 months ago

club-3090 by noonghunna

Local LLM serving recipes for RTX 3090 GPUs

Created 2 months ago

Updated 11 hours ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI),

Zhiqiang Xie

Zhiqiang Xie(Coauthor of SGLang), and

1 more.

TileRT by tile-ai

Ultra-low-latency LLM inference runtime

Created 7 months ago

Updated 3 weeks ago

can-i-finetune-this by DaoyuanLi2816

Estimate and optimize LLM fine-tuning on consumer GPUs

Created 1 month ago

Updated 3 days ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Johannes Hagemann

Johannes Hagemann(Cofounder of Prime Intellect), and

4 more.

S-LoRA by S-LoRA

System for scalable LoRA adapter serving

Created 2 years ago

Updated 2 years ago

amd-strix-halo-toolboxes by kyuz0

LLM inference toolboxes for AMD Ryzen AI Max

Created 11 months ago

Updated 1 week ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Piero Molino

Piero Molino(Cofounder of Predibase).

lucebox-hub by Luce-Org

Optimized LLM inference for specific hardware

Created 3 months ago

Updated 2 days ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Ying Sheng

Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

High-performance C++ LLM inference library

Created 3 years ago

Updated 1 week ago

Feedback? Help us improve.