amd-strix-halo-vllm-toolboxes  by kyuz0

LLM serving container for AMD Strix Halo hardware

Created 7 months ago
269 stars

Top 95.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides optimized Docker/Podman containers and Toolbx integration for serving Large Language Models (LLMs) using vLLM on AMD Ryzen AI Max “Strix Halo” (gfx1151) hardware. It targets engineers and researchers seeking high-performance LLM inference, offering advanced features like RDMA-based distributed clustering for unified memory expansion across multiple nodes. The project enables efficient LLM deployment on specialized AMD hardware, leveraging ROCm nightly builds.

How It Works

The core approach utilizes Fedora 43-based Toolbx-compatible containers built upon TheRock nightly ROCm builds. A key innovation is a custom ROCm/RCCL patch enabling native RDMA/RoCE v2 support, facilitating low-latency, high-bandwidth communication between nodes. This allows for Tensor Parallelism across multiple Strix Halo devices, effectively pooling their memory for larger model deployments. vLLM serves as the inference engine, optimized within this containerized environment.

Quick Start & Requirements

  • Installation: Use refresh_toolbox.sh for Fedora Toolbx or distrobox create for Ubuntu.
  • Prerequisites: AMD Strix Halo (gfx1151) GPU, Fedora 43 (or Ubuntu with Distrobox), ROCm (via TheRock nightly builds). Specific kernel parameters (iommu=pt, amdgpu.gttsize=126976, ttm.pages_limit=32505856) may be required for optimal unified memory configuration.
  • Resources: Requires AMD Strix Halo hardware, substantial system RAM (e.g., 128 GB tested), and BIOS GPU memory allocation.
  • Documentation: Project overview, tutorials, and host configuration available at https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/.

Highlighted Details

  • RDMA/RoCE Clustering: Native support for high-performance, low-latency distributed LLM inference across multiple Strix Halo nodes using RDMA/RoCE v2, enabling Tensor Parallelism (TP=2) with pooled memory.
  • Extended Context Lengths: Benchmarks demonstrate support for large context windows (up to 256k tokens) with various models like Llama 3.1, Gemma 3, and Qwen3, achieving high GPU utilization.
  • Flexible Containerization: Supports both Fedora Toolbx for development (shared HOME directory) and Docker/Podman for deployment (service isolation).
  • TUI Wizard: Includes a start-vllm TUI wizard for simplified model serving setup.

Maintenance & Community

This project is maintained as a hobby in the author's spare time. Support is available via voluntary contributions ("buy me a coffee"). No specific community channels (like Discord or Slack) are listed.

Licensing & Compatibility

The license type is not explicitly stated in the provided README content. Compatibility is specific to AMD Strix Halo (gfx1151) hardware.

Limitations & Caveats

Vision model support is currently unavailable due to a patch disabling vision encoder profiling to prevent indefinite hangs during MIOpen kernel searches. The project primarily targets Fedora 43, with Ubuntu support relying on Distrobox.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
6
Star History
59 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.2%
2k
System for scalable LoRA adapter serving
Created 2 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.3%
1k
LLM inference engine for diverse applications
Created 2 years ago
Updated 22 hours ago
Feedback? Help us improve.