mscclpp  by microsoft

GPU-driven communication stack for scalable AI applications

Created 2 years ago
451 stars

Top 66.7% on SourcePulse

GitHubView on GitHub
Project Summary

MSCCL++ is a GPU-driven communication stack designed to enhance the efficiency and customizability of distributed AI applications. It offers a flexible, multi-layer abstraction for inter-GPU communication, targeting researchers and engineers working with large-scale AI models, particularly for LLM inference. The primary benefit is improved performance and reduced complexity in managing GPU-to-GPU data movement.

How It Works

MSCCL++ provides ultra-lightweight, on-GPU communication interfaces called "Channels" that can be called directly from CUDA kernels. These channels abstract peer-to-peer communication, enabling efficient data movement and synchronization primitives like put(), get(), signal(), flush(), and wait(). It supports both 0-copy synchronous and asynchronous operations, allowing for communication-to-computation overlap and custom collective algorithms without deadlocks. MSCCL++ unifies abstractions across different hardware interconnects (NVLink, InfiniBand) and GPU locations (local/remote nodes).

Quick Start & Requirements

  • Installation: The README does not provide specific installation commands but implies building from source.
  • Prerequisites: CUDA, ROCm (for integration tests), C++, Python.
  • Resources: Benchmarks suggest usage on Azure NDmv4 SKUs with A100-80G GPUs.
  • Links: MSCCL++ Overview, Quick Start (link not functional in provided text).

Highlighted Details

  • Demonstrates significant speedups over NCCL for AllReduce operations, crucial for LLM serving with tensor parallelism.
  • Offers two channel types: PortChannel (port-mapping, single GPU thread, proxy-based) and MemoryChannel (memory-mapping, direct GPU thread access, low-latency focused).
  • Supports custom host-side proxies for advanced optimization and tailored trigger handling.
  • Provides Python bindings for easier integration into Python-based AI frameworks.

Maintenance & Community

  • Developed by Microsoft.
  • Welcomes contributions via a Contributor License Agreement (CLA).
  • Adheres to the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • The README does not explicitly state the license type.

Limitations & Caveats

  • The README does not detail specific limitations, unsupported platforms, or known bugs. The "Quick Start" link appears to be non-functional in the provided text.
Health Check
Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
17
Issues (30d)
4
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.1%
2k
System for scalable LoRA adapter serving
Created 2 years ago
Updated 2 years ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

gpu.cpp by AnswerDotAI

0.1%
4k
C++ library for portable GPU computation using WebGPU
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.