pplx-garden  by perplexityai

High-performance LLM inference engine

Created 3 weeks ago

New!

274 stars

Top 94.4% on SourcePulse

GitHubView on GitHub
Project Summary

Perplexity AI's pplx-garden provides an open-source toolkit for high-performance LLM inference, specifically addressing RDMA point-to-point communication for Mixture-of-Experts (MoE) systems. It enables researchers and engineers to optimize inter-node communication, reducing latency and improving throughput for large-scale AI deployments. The project offers a novel approach to P2P MoE dispatch and combine kernels, aiming to accelerate LLM inference workloads.

How It Works

The core of pplx-garden is its RDMA TransferEngine library, designed for efficient inter-node data transfer in LLM systems. It implements P2P MoE dispatch/combine kernels, optimizing for decode operations while supporting prefill. The system utilizes NVLink for intra-node communication and RDMA for inter-node transfers, supporting NVIDIA ConnectX-7 and AWS EFA NICs. A key design choice is splitting send and receive stages to enable micro-batching and achieve SM-free RDMA transfers, further enhancing performance and efficiency.

Quick Start & Requirements

  • Installation: Development is facilitated via a Docker image (docker build -t pplx-garden-dev - < docker/dev.Dockerfile, ./scripts/run-docker.sh). Python wheels can be built and installed using python3 -m build --wheel and python3 -m pip install /app/dist/*.whl after setting TORCH_CMAKE_PREFIX_PATH.
  • Prerequisites: Linux Kernel 5.12+, CUDA 12.8+, libfabric, libibverbs, GDRCopy. Requires an RDMA network with GPUDirect RDMA support, where each GPU has at least one dedicated RDMA NIC. SYS_PTRACE and SYS_ADMIN capabilities are also needed.
  • Links: The project cites an arXiv paper: https://arxiv.org/abs/2510.27656.

Highlighted Details

  • Performance: Benchmarks demonstrate competitive or superior performance against DeepEP-CX7 for both decode (e.g., 110.2 μs for pplx-CX7 EP16 decode dispatch/combine) and prefill (e.g., 2481.9 μs for DeepEP-CX7 EP16 prefill dispatch/combine) across various configurations and NICs (EFA, CX7).
  • Hardware Agnostic: Supports NVIDIA ConnectX-7 and AWS EFA, with potential for other RDMA NICs. Offers aggregation of multiple NICs per GPU.
  • Optimizations: Features SM-free RDMA transfer, CUDA Graph support within the TransferEngine, and optimized P2P MoE dispatch/combine kernels.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or roadmap were provided in the README excerpt.

Licensing & Compatibility

  • License: The license type is not explicitly stated in the provided README.
  • Compatibility: Requires specific, high-performance networking hardware (RDMA, GPUDirect RDMA) and recent CUDA/Linux kernel versions, indicating compatibility is limited to specialized environments.

Limitations & Caveats

The project necessitates a complex and specific hardware infrastructure, including RDMA-capable network interfaces and GPUDirect RDMA support, which may present a significant barrier to adoption for users without such setups. Specific, recent versions of Linux kernel and CUDA are also required.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
3
Star History
277 stars in the last 26 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
8 more.

lorax by predibase

0.4%
4k
Multi-LoRA inference server for serving 1000s of fine-tuned LLMs
Created 2 years ago
Updated 6 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
60 more.

vllm by vllm-project

0.8%
64k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.