club-3090 by noonghunna

Local LLM serving recipes for RTX 3090 GPUs

Created 2 months ago

1,657 stars

Top 24.6% on SourcePulse

Project Summary

This repository provides community-vetted configurations and benchmarks for serving modern LLMs locally on NVIDIA RTX 3090 GPUs. It targets users with one or two 3090s seeking to run LLMs at home or in a homelab, offering optimized setups for maximum throughput or maximum context/robustness, complete with a drop-in OpenAI-compatible API.

How It Works

The project employs a multi-engine, model-agnostic approach, supporting vLLM for high-throughput inference (up to 127 TPS) and llama.cpp for maximum context length (262K) and robustness. Configurations are provided via Docker Compose, enabling an OpenAI-compatible API endpoint. The architecture scales easily for new models, currently featuring Qwen3.6-27B.

Quick Start & Requirements

Installation involves cloning the repo, running scripts/setup.sh <model>, and then scripts/launch.sh for an interactive setup. Key requirements include 1-2x NVIDIA RTX 3090 (24 GB each), Linux (Ubuntu 22.04+ tested), Docker with NVIDIA Container Toolkit, and NVIDIA driver 580.x+. Detailed hardware notes are in docs/HARDWARE.md.

Highlighted Details

Achieve up to 127 TPS with vLLM on dual RTX 3090s, supporting vision, tools, and MTP.
Run full 262K context on a single RTX 3090 using llama.cpp, with stress-tested stability for agents.
Deploy an OpenAI-compatible API on localhost:8020 via validated Docker Compose.
Model-agnostic design allows easy integration of new LLMs.

Maintenance & Community

The project acknowledges contributions from key individuals and projects like vLLM and llama.cpp. It consolidates previous efforts into a single repository, encouraging new issues here. Community feedback from Reddit/X is noted, but direct community links are absent.

Licensing & Compatibility

Licensed under Apache 2.0, permitting broad usage, modification, and distribution, including for commercial purposes.

Limitations & Caveats

Focuses on RTX 3090 (24 GB); smaller GPUs are insufficient for 27B models. vLLM requires Linux/CUDA; llama.cpp recipes assume Linux paths. SGLang engine is currently blocked.

club-3090 by noonghunna

Explore Similar Projects

vllm-swift by TheTom

kaiwu by val1813

Kolosal by KolosalAI

dotLLM by kkokosa

aikit by kaito-project

llama.cpp-deepseek-v4-flash by antirez

ollama-intel-gpu by mattcurf

sarathi-serve by microsoft

JetStream by AI-Hypercomputer

ServerlessLLM by ServerlessLLM

lemonade by lemonade-sdk

ipex-llm by intel