amd-strix-halo-vllm-toolboxes by kyuz0

LLM serving container for AMD Strix Halo hardware

Created 10 months ago

464 stars

Top 64.6% on SourcePulse

Project Summary

This repository provides optimized Docker/Podman containers and Toolbx integration for serving Large Language Models (LLMs) using vLLM on AMD Ryzen AI Max “Strix Halo” (gfx1151) hardware. It targets engineers and researchers seeking high-performance LLM inference, offering advanced features like RDMA-based distributed clustering for unified memory expansion across multiple nodes. The project enables efficient LLM deployment on specialized AMD hardware, leveraging ROCm nightly builds.

How It Works

The core approach utilizes Fedora 43-based Toolbx-compatible containers built upon TheRock nightly ROCm builds. A key innovation is a custom ROCm/RCCL patch enabling native RDMA/RoCE v2 support, facilitating low-latency, high-bandwidth communication between nodes. This allows for Tensor Parallelism across multiple Strix Halo devices, effectively pooling their memory for larger model deployments. vLLM serves as the inference engine, optimized within this containerized environment.

Quick Start & Requirements

Installation: Use refresh_toolbox.sh for Fedora Toolbx or distrobox create for Ubuntu.
Prerequisites: AMD Strix Halo (gfx1151) GPU, Fedora 43 (or Ubuntu with Distrobox), ROCm (via TheRock nightly builds). Specific kernel parameters (iommu=pt, amdgpu.gttsize=126976, ttm.pages_limit=32505856) may be required for optimal unified memory configuration.
Resources: Requires AMD Strix Halo hardware, substantial system RAM (e.g., 128 GB tested), and BIOS GPU memory allocation.
Documentation: Project overview, tutorials, and host configuration available at https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/.

Highlighted Details

RDMA/RoCE Clustering: Native support for high-performance, low-latency distributed LLM inference across multiple Strix Halo nodes using RDMA/RoCE v2, enabling Tensor Parallelism (TP=2) with pooled memory.
Extended Context Lengths: Benchmarks demonstrate support for large context windows (up to 256k tokens) with various models like Llama 3.1, Gemma 3, and Qwen3, achieving high GPU utilization.
Flexible Containerization: Supports both Fedora Toolbx for development (shared HOME directory) and Docker/Podman for deployment (service isolation).
TUI Wizard: Includes a start-vllm TUI wizard for simplified model serving setup.

Maintenance & Community

This project is maintained as a hobby in the author's spare time. Support is available via voluntary contributions ("buy me a coffee"). No specific community channels (like Discord or Slack) are listed.

Licensing & Compatibility

The license type is not explicitly stated in the provided README content. Compatibility is specific to AMD Strix Halo (gfx1151) hardware.

Limitations & Caveats

Vision model support is currently unavailable due to a patch disabling vision encoder profiling to prevent indefinite hangs during MIOpen kernel searches. The project primarily targets Fedora 43, with Ubuntu support relying on Distrobox.

amd-strix-halo-vllm-toolboxes by kyuz0

Explore Similar Projects

zinc by zolotukhin

eLLM by lucienhuangfu

Nanoflow by efeslab

llama.cpp-deepseek-v4-flash by antirez

llamacpp-rocm by lemonade-sdk

sarathi-serve by microsoft

ServerlessLLM by ServerlessLLM

xFasterTransformer by intel

S-LoRA by S-LoRA

amd-strix-halo-toolboxes by kyuz0

rtp-llm by alibaba

distributed-llama by b4rtaz