llm-server-docs  by varunvasudeva1

Docs for local LLM server setup on Debian

created 1 year ago
511 stars

Top 62.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive guide for setting up a fully local and private language model server on Debian. It targets Linux beginners and enthusiasts looking to integrate LLM inference, chat interfaces, text-to-speech, and text-to-image generation into a single, cohesive system. The primary benefit is achieving a cloud-like experience for AI applications without relying on external services, ensuring data privacy and control.

How It Works

The setup involves installing and configuring multiple components: inference engines (Ollama, llama.cpp, vLLM), a chat platform (Open WebUI), a text-to-speech server (OpenedAI Speech or Kokoro FastAPI), and a text-to-image server (ComfyUI). The guide emphasizes Debian as the base OS, detailing driver installation (Nvidia/AMD), power management for GPUs, auto-login, and service management via systemd or Docker. It offers choices between inference backends based on user needs for control, model format support, and features like vision capabilities.

Quick Start & Requirements

  • Installation: Primarily uses apt for system packages and docker for most applications. Inference engines like llama.cpp and vLLM require manual compilation or pip installation.
  • Prerequisites: Debian Linux, Internet connection, basic Linux terminal knowledge, monitor/keyboard/mouse for initial setup.
  • Hardware: Any modern CPU/GPU combination; Nvidia RTX 3090 (24GB VRAM) recommended for reference. AMD GPU support is noted. CPU-only inference is possible.
  • Dependencies: Docker, HuggingFace CLI (for llama.cpp/vLLM), Python virtual environments.
  • Resources: Requires significant disk space for models and sufficient RAM/VRAM for LLM inference.
  • Links: Debian, Docker, Ollama, llama.cpp, vLLM, Open WebUI, OpenedAI Speech, Kokoro FastAPI, ComfyUI.

Highlighted Details

  • Comprehensive integration of LLM inference, chat, TTS, and image generation.
  • Detailed setup for Nvidia and AMD GPUs, including driver installation and CUDA configuration.
  • Choice of inference engines: Ollama (ease of use), llama.cpp (control), vLLM (advanced features, non-GGUF models).
  • Remote access via SSH and Tailscale for headless operation.

Maintenance & Community

The repository is maintained by varunvasudeva1. It references community projects and encourages contributions and stars. Updates are provided for core components like Ollama, Open WebUI, and inference engines.

Licensing & Compatibility

The repository itself does not specify a license, but it guides the setup of projects with various open-source licenses (MIT, Apache 2.0, etc.). Compatibility for commercial use depends on the licenses of the individual components used.

Limitations & Caveats

The guide is tailored for Debian and may require adjustments for other Linux distributions. It assumes a level of comfort with the Linux terminal, though it aims to be beginner-friendly. Some steps, like GPU driver installation and CUDA path configuration, can be complex. The author notes this is their first server setup, implying potential for improved methods.

Health Check
Last commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
73 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 18 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 14 hours ago
Feedback? Help us improve.