mscclpp by microsoft

GPU-driven communication stack for scalable AI applications

Created 2 years ago

451 stars

Top 66.7% on SourcePulse

View on GitHub

3 Experts Love This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Jonathan Ragan-Kelley

Professor at MIT

Woosuk Kwon

Coauthor of vLLM

Project Summary

MSCCL++ is a GPU-driven communication stack designed to enhance the efficiency and customizability of distributed AI applications. It offers a flexible, multi-layer abstraction for inter-GPU communication, targeting researchers and engineers working with large-scale AI models, particularly for LLM inference. The primary benefit is improved performance and reduced complexity in managing GPU-to-GPU data movement.

How It Works

MSCCL++ provides ultra-lightweight, on-GPU communication interfaces called "Channels" that can be called directly from CUDA kernels. These channels abstract peer-to-peer communication, enabling efficient data movement and synchronization primitives like put(), get(), signal(), flush(), and wait(). It supports both 0-copy synchronous and asynchronous operations, allowing for communication-to-computation overlap and custom collective algorithms without deadlocks. MSCCL++ unifies abstractions across different hardware interconnects (NVLink, InfiniBand) and GPU locations (local/remote nodes).

Quick Start & Requirements

Installation: The README does not provide specific installation commands but implies building from source.
Prerequisites: CUDA, ROCm (for integration tests), C++, Python.
Resources: Benchmarks suggest usage on Azure NDmv4 SKUs with A100-80G GPUs.
Links: MSCCL++ Overview, Quick Start (link not functional in provided text).

Highlighted Details

Demonstrates significant speedups over NCCL for AllReduce operations, crucial for LLM serving with tensor parallelism.
Offers two channel types: PortChannel (port-mapping, single GPU thread, proxy-based) and MemoryChannel (memory-mapping, direct GPU thread access, low-latency focused).
Supports custom host-side proxies for advanced optimization and tailored trigger handling.
Provides Python bindings for easier integration into Python-based AI frameworks.

Maintenance & Community

Developed by Microsoft.
Welcomes contributions via a Contributor License Agreement (CLA).
Adheres to the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

The README does not explicitly state the license type.

Limitations & Caveats

The README does not detail specific limitations, unsupported platforms, or known bugs. The "Quick Start" link appears to be non-functional in the provided text.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days