mscclpp  by microsoft

GPU-driven communication stack for scalable AI applications

Created 2 years ago
417 stars

Top 70.4% on SourcePulse

GitHubView on GitHub
Project Summary

MSCCL++ is a GPU-driven communication stack designed to enhance the efficiency and customizability of distributed AI applications. It offers a flexible, multi-layer abstraction for inter-GPU communication, targeting researchers and engineers working with large-scale AI models, particularly for LLM inference. The primary benefit is improved performance and reduced complexity in managing GPU-to-GPU data movement.

How It Works

MSCCL++ provides ultra-lightweight, on-GPU communication interfaces called "Channels" that can be called directly from CUDA kernels. These channels abstract peer-to-peer communication, enabling efficient data movement and synchronization primitives like put(), get(), signal(), flush(), and wait(). It supports both 0-copy synchronous and asynchronous operations, allowing for communication-to-computation overlap and custom collective algorithms without deadlocks. MSCCL++ unifies abstractions across different hardware interconnects (NVLink, InfiniBand) and GPU locations (local/remote nodes).

Quick Start & Requirements

  • Installation: The README does not provide specific installation commands but implies building from source.
  • Prerequisites: CUDA, ROCm (for integration tests), C++, Python.
  • Resources: Benchmarks suggest usage on Azure NDmv4 SKUs with A100-80G GPUs.
  • Links: MSCCL++ Overview, Quick Start (link not functional in provided text).

Highlighted Details

  • Demonstrates significant speedups over NCCL for AllReduce operations, crucial for LLM serving with tensor parallelism.
  • Offers two channel types: PortChannel (port-mapping, single GPU thread, proxy-based) and MemoryChannel (memory-mapping, direct GPU thread access, low-latency focused).
  • Supports custom host-side proxies for advanced optimization and tailored trigger handling.
  • Provides Python bindings for easier integration into Python-based AI frameworks.

Maintenance & Community

  • Developed by Microsoft.
  • Welcomes contributions via a Contributor License Agreement (CLA).
  • Adheres to the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • The README does not explicitly state the license type.

Limitations & Caveats

  • The README does not detail specific limitations, unsupported platforms, or known bugs. The "Quick Start" link appears to be non-functional in the provided text.
Health Check
Last Commit

17 hours ago

Responsiveness

1 day

Pull Requests (30d)
15
Issues (30d)
2
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Johannes Hagemann Johannes Hagemann(Cofounder of Prime Intellect), and
4 more.

S-LoRA by S-LoRA

0.2%
2k
System for scalable LoRA adapter serving
Created 1 year ago
Updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
3 more.

gpu.cpp by AnswerDotAI

0%
4k
C++ library for portable GPU computation using WebGPU
Created 1 year ago
Updated 2 months ago
Feedback? Help us improve.