spark-vllm-docker  by eugr

Dockerized vLLM for high-performance multi-node inference

Created 3 months ago
390 stars

Top 73.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides Docker configurations and startup scripts for deploying vLLM, a high-throughput LLM inference engine, on DGX Spark systems. It targets users needing to run large language models efficiently in multi-node or single-node setups, leveraging Ray for cluster management and InfiniBand/RDMA for high-performance communication. The primary benefit is enabling optimized, scalable LLM inference on specialized hardware.

How It Works

The project utilizes Docker to package vLLM and its dependencies, integrating with Ray for distributed execution across multiple DGX Spark nodes. It is specifically engineered to leverage InfiniBand/NCCL for low-latency, high-bandwidth inter-node communication, crucial for large-scale inference workloads. The approach prioritizes performance by building directly from the vLLM main branch and offering optimizations for DGX Spark's networking and hardware architecture.

Quick Start & Requirements

  • Primary Install/Run:
    • Build Image: ./build-and-copy.sh (recommended for cluster deployment) or docker build -f Dockerfile.wheels -t vllm-node . (for wheels build).
    • Run (Single Node): ./launch-cluster.sh --solo exec vllm serve <model> ... or docker run ... vllm serve <model> ...
    • Run (Cluster): ./launch-cluster.sh exec vllm serve <model> ...
  • Prerequisites:
    • DGX Spark hardware (single or multi-node).
    • InfiniBand/RDMA support.
    • Passwordless SSH configured for multi-node setups.
    • Python 3.12.3 environment (as per build output).
    • CUDA 12.1a architecture targeted by default; configurable via --gpu-arch.
    • uvx for hf-download.sh script.
  • Links:
    • vLLM nightly wheels status: https://wheels.vllm.ai/nightly/cu130/vllm/
    • NVidia Connect Two Sparks Playbook (for SSH setup).
    • NVidia Networking Guide.

Highlighted Details

  • Optimized for multi-node vLLM inference clusters using Ray and InfiniBand/RDMA (NCCL).
  • Includes convenience scripts (build-and-copy.sh, launch-cluster.sh, hf-download.sh) for streamlined build, deployment, and model management.
  • Supports applying custom mods and patches for model-specific compatibility fixes and experimental features.
  • Experimental fastsafetensors support for accelerated model loading.
  • Experimental MXFP4 build offers high performance for specific models like GPT-OSS on DGX Spark.

Maintenance & Community

This project is a community effort, not officially affiliated with NVIDIA. It acknowledges contributions from individuals like @raphaelamorim and @ericlewis. Specific community channels or roadmaps are not detailed in the provided README excerpt.

Licensing & Compatibility

The specific open-source license for this repository is not explicitly stated in the provided README excerpt. Compatibility notes for commercial use or integration with closed-source projects are also not detailed.

Limitations & Caveats

The Dockerfile builds from the vLLM main branch, which may occasionally be in an unstable state. Wheel builds can encounter platform-specific dependency issues. NVFP4 models on Spark are noted as having suboptimal performance and potential stability issues within vLLM. The fastsafetensors feature in a cluster configuration is experimental. The default build targets CUDA 12.1a architecture, requiring explicit configuration for other GPU architectures.

Health Check
Last Commit

22 hours ago

Responsiveness

Inactive

Pull Requests (30d)
10
Issues (30d)
35
Star History
250 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
3 more.

LitServe by Lightning-AI

0.1%
4k
AI inference pipeline framework
Created 2 years ago
Updated 22 hours ago
Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
3 more.

llm-d by llm-d

0.7%
3k
Kubernetes-native framework for distributed LLM inference
Created 10 months ago
Updated 1 day ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

serve by pytorch

0%
4k
Serve, optimize, and scale PyTorch models in production
Created 6 years ago
Updated 6 months ago
Feedback? Help us improve.