spark-vllm-docker by eugr

Dockerized vLLM for high-performance multi-node inference

Created 7 months ago

1,822 stars

Top 22.9% on SourcePulse

Project Summary

This repository provides Docker configurations and startup scripts for deploying vLLM, a high-throughput LLM inference engine, on DGX Spark systems. It targets users needing to run large language models efficiently in multi-node or single-node setups, leveraging Ray for cluster management and InfiniBand/RDMA for high-performance communication. The primary benefit is enabling optimized, scalable LLM inference on specialized hardware.

How It Works

The project utilizes Docker to package vLLM and its dependencies, integrating with Ray for distributed execution across multiple DGX Spark nodes. It is specifically engineered to leverage InfiniBand/NCCL for low-latency, high-bandwidth inter-node communication, crucial for large-scale inference workloads. The approach prioritizes performance by building directly from the vLLM main branch and offering optimizations for DGX Spark's networking and hardware architecture.

Quick Start & Requirements

Primary Install/Run:
- Build Image: ./build-and-copy.sh (recommended for cluster deployment) or docker build -f Dockerfile.wheels -t vllm-node . (for wheels build).
- Run (Single Node): ./launch-cluster.sh --solo exec vllm serve <model> ... or docker run ... vllm serve <model> ...
- Run (Cluster): ./launch-cluster.sh exec vllm serve <model> ...
Prerequisites:
- DGX Spark hardware (single or multi-node).
- InfiniBand/RDMA support.
- Passwordless SSH configured for multi-node setups.
- Python 3.12.3 environment (as per build output).
- CUDA 12.1a architecture targeted by default; configurable via --gpu-arch.
- uvx for hf-download.sh script.
Links:
- vLLM nightly wheels status: https://wheels.vllm.ai/nightly/cu130/vllm/
- NVidia Connect Two Sparks Playbook (for SSH setup).
- NVidia Networking Guide.

Highlighted Details

Optimized for multi-node vLLM inference clusters using Ray and InfiniBand/RDMA (NCCL).
Includes convenience scripts (build-and-copy.sh, launch-cluster.sh, hf-download.sh) for streamlined build, deployment, and model management.
Supports applying custom mods and patches for model-specific compatibility fixes and experimental features.
Experimental fastsafetensors support for accelerated model loading.
Experimental MXFP4 build offers high performance for specific models like GPT-OSS on DGX Spark.

Maintenance & Community

This project is a community effort, not officially affiliated with NVIDIA. It acknowledges contributions from individuals like @raphaelamorim and @ericlewis. Specific community channels or roadmaps are not detailed in the provided README excerpt.

Licensing & Compatibility

The specific open-source license for this repository is not explicitly stated in the provided README excerpt. Compatibility notes for commercial use or integration with closed-source projects are also not detailed.

Limitations & Caveats

The Dockerfile builds from the vLLM main branch, which may occasionally be in an unstable state. Wheel builds can encounter platform-specific dependency issues. NVFP4 models on Spark are noted as having suboptimal performance and potential stability issues within vLLM. The fastsafetensors feature in a cluster configuration is experimental. The default build targets CUDA 12.1a architecture, requiring explicit configuration for other GPU architectures.

spark-vllm-docker by eugr

Explore Similar Projects

sparkrun by spark-arena

shard by leyten

amd-strix-halo-vllm-toolboxes by kyuz0

vllm-playground by micytao

JetStream by AI-Hypercomputer

ServerlessLLM by ServerlessLLM

club-3090 by noonghunna

mesh-llm by Mesh-LLM

kaito by kaito-project

LitServe by Lightning-AI

aibrix by vllm-project

serve by pytorch