club-3090  by noonghunna

Local LLM serving recipes for RTX 3090 GPUs

Created 4 weeks ago

New!

1,150 stars

Top 33.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides community-vetted configurations and benchmarks for serving modern LLMs locally on NVIDIA RTX 3090 GPUs. It targets users with one or two 3090s seeking to run LLMs at home or in a homelab, offering optimized setups for maximum throughput or maximum context/robustness, complete with a drop-in OpenAI-compatible API.

How It Works

The project employs a multi-engine, model-agnostic approach, supporting vLLM for high-throughput inference (up to 127 TPS) and llama.cpp for maximum context length (262K) and robustness. Configurations are provided via Docker Compose, enabling an OpenAI-compatible API endpoint. The architecture scales easily for new models, currently featuring Qwen3.6-27B.

Quick Start & Requirements

Installation involves cloning the repo, running scripts/setup.sh <model>, and then scripts/launch.sh for an interactive setup. Key requirements include 1-2x NVIDIA RTX 3090 (24 GB each), Linux (Ubuntu 22.04+ tested), Docker with NVIDIA Container Toolkit, and NVIDIA driver 580.x+. Detailed hardware notes are in docs/HARDWARE.md.

Highlighted Details

  • Achieve up to 127 TPS with vLLM on dual RTX 3090s, supporting vision, tools, and MTP.
  • Run full 262K context on a single RTX 3090 using llama.cpp, with stress-tested stability for agents.
  • Deploy an OpenAI-compatible API on localhost:8020 via validated Docker Compose.
  • Model-agnostic design allows easy integration of new LLMs.

Maintenance & Community

The project acknowledges contributions from key individuals and projects like vLLM and llama.cpp. It consolidates previous efforts into a single repository, encouraging new issues here. Community feedback from Reddit/X is noted, but direct community links are absent.

Licensing & Compatibility

Licensed under Apache 2.0, permitting broad usage, modification, and distribution, including for commercial purposes.

Limitations & Caveats

Focuses on RTX 3090 (24 GB); smaller GPUs are insufficient for 27B models. vLLM requires Linux/CUDA; llama.cpp recipes assume Linux paths. SGLang engine is currently blocked.

Health Check
Last Commit

19 hours ago

Responsiveness

Inactive

Pull Requests (30d)
86
Issues (30d)
96
Star History
1,158 stars in the last 29 days

Explore Similar Projects

Feedback? Help us improve.