pplx-garden by perplexityai

High-performance LLM inference engine

Created 4 months ago

376 stars

Top 75.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Project Summary

Perplexity AI's pplx-garden provides an open-source toolkit for high-performance LLM inference, specifically addressing RDMA point-to-point communication for Mixture-of-Experts (MoE) systems. It enables researchers and engineers to optimize inter-node communication, reducing latency and improving throughput for large-scale AI deployments. The project offers a novel approach to P2P MoE dispatch and combine kernels, aiming to accelerate LLM inference workloads.

How It Works

The core of pplx-garden is its RDMA TransferEngine library, designed for efficient inter-node data transfer in LLM systems. It implements P2P MoE dispatch/combine kernels, optimizing for decode operations while supporting prefill. The system utilizes NVLink for intra-node communication and RDMA for inter-node transfers, supporting NVIDIA ConnectX-7 and AWS EFA NICs. A key design choice is splitting send and receive stages to enable micro-batching and achieve SM-free RDMA transfers, further enhancing performance and efficiency.

Quick Start & Requirements

Installation: Development is facilitated via a Docker image (docker build -t pplx-garden-dev - < docker/dev.Dockerfile, ./scripts/run-docker.sh). Python wheels can be built and installed using python3 -m build --wheel and python3 -m pip install /app/dist/*.whl after setting TORCH_CMAKE_PREFIX_PATH.
Prerequisites: Linux Kernel 5.12+, CUDA 12.8+, libfabric, libibverbs, GDRCopy. Requires an RDMA network with GPUDirect RDMA support, where each GPU has at least one dedicated RDMA NIC. SYS_PTRACE and SYS_ADMIN capabilities are also needed.
Links: The project cites an arXiv paper: https://arxiv.org/abs/2510.27656.

Highlighted Details

Performance: Benchmarks demonstrate competitive or superior performance against DeepEP-CX7 for both decode (e.g., 110.2 μs for pplx-CX7 EP16 decode dispatch/combine) and prefill (e.g., 2481.9 μs for DeepEP-CX7 EP16 prefill dispatch/combine) across various configurations and NICs (EFA, CX7).
Hardware Agnostic: Supports NVIDIA ConnectX-7 and AWS EFA, with potential for other RDMA NICs. Offers aggregation of multiple NICs per GPU.
Optimizations: Features SM-free RDMA transfer, CUDA Graph support within the TransferEngine, and optimized P2P MoE dispatch/combine kernels.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or roadmap were provided in the README excerpt.

Licensing & Compatibility

License: The license type is not explicitly stated in the provided README.
Compatibility: Requires specific, high-performance networking hardware (RDMA, GPUDirect RDMA) and recent CUDA/Linux kernel versions, indicating compatibility is limited to specialized environments.

Limitations & Caveats

The project necessitates a complex and specific hardware infrastructure, including RDMA-capable network interfaces and GPUDirect RDMA support, which may present a significant barrier to adoption for users without such setups. Specific, recent versions of Linux kernel and CUDA are also required.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days