mscclpp  by microsoft

GPU-driven communication stack for scalable AI applications

created 2 years ago
392 stars

Top 74.5% on sourcepulse

GitHubView on GitHub
Project Summary

MSCCL++ is a GPU-driven communication stack designed to enhance the efficiency and customizability of distributed AI applications. It offers a flexible, multi-layer abstraction for inter-GPU communication, targeting researchers and engineers working with large-scale AI models, particularly for LLM inference. The primary benefit is improved performance and reduced complexity in managing GPU-to-GPU data movement.

How It Works

MSCCL++ provides ultra-lightweight, on-GPU communication interfaces called "Channels" that can be called directly from CUDA kernels. These channels abstract peer-to-peer communication, enabling efficient data movement and synchronization primitives like put(), get(), signal(), flush(), and wait(). It supports both 0-copy synchronous and asynchronous operations, allowing for communication-to-computation overlap and custom collective algorithms without deadlocks. MSCCL++ unifies abstractions across different hardware interconnects (NVLink, InfiniBand) and GPU locations (local/remote nodes).

Quick Start & Requirements

  • Installation: The README does not provide specific installation commands but implies building from source.
  • Prerequisites: CUDA, ROCm (for integration tests), C++, Python.
  • Resources: Benchmarks suggest usage on Azure NDmv4 SKUs with A100-80G GPUs.
  • Links: MSCCL++ Overview, Quick Start (link not functional in provided text).

Highlighted Details

  • Demonstrates significant speedups over NCCL for AllReduce operations, crucial for LLM serving with tensor parallelism.
  • Offers two channel types: PortChannel (port-mapping, single GPU thread, proxy-based) and MemoryChannel (memory-mapping, direct GPU thread access, low-latency focused).
  • Supports custom host-side proxies for advanced optimization and tailored trigger handling.
  • Provides Python bindings for easier integration into Python-based AI frameworks.

Maintenance & Community

  • Developed by Microsoft.
  • Welcomes contributions via a Contributor License Agreement (CLA).
  • Adheres to the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • The README does not explicitly state the license type.

Limitations & Caveats

  • The README does not detail specific limitations, unsupported platforms, or known bugs. The "Quick Start" link appears to be non-functional in the provided text.
Health Check
Last commit

19 hours ago

Responsiveness

1 day

Pull Requests (30d)
40
Issues (30d)
1
Star History
52 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
7 more.

ThunderKittens by HazyResearch

0.6%
3k
CUDA kernel framework for fast deep learning primitives
created 1 year ago
updated 3 days ago
Feedback? Help us improve.