rccl  by ROCm

ROCm library for GPU collective communication routines

created 7 years ago
353 stars

Top 80.0% on sourcepulse

GitHubView on GitHub
Project Summary

The ROCm Communication Collectives Library (RCCL) provides optimized collective communication routines for GPUs, targeting researchers and developers building large-scale AI and HPC applications. It enables efficient inter-GPU communication across multiple nodes, aiming to maximize bandwidth and minimize latency.

How It Works

RCCL implements standard collective operations like all-reduce, broadcast, and all-gather using ring and tree algorithms. It is optimized for various interconnects (PCIe, xGMI, InfiniBand, TCP/IP) and supports arbitrary numbers of GPUs in single or multi-node, multi-process applications. For performance, small operations can be batched or aggregated via the API.

Quick Start & Requirements

  • Install: Use the provided install.sh script (./install.sh) or build manually with CMake.
  • Prerequisites: ROCm stack (HIP runtime & HIP-Clang), ROCm supported GPUs.
  • Build: install.sh offers options for quick builds, debugging, and targeting specific GPU architectures. Manual build requires cmake .. && make -j <jobs>.
  • Documentation: Available at RCCL Documentation Site.

Highlighted Details

  • Supports direct GPU-to-GPU send/receive operations.
  • Optimized for high bandwidth on PCIe, xGMI, InfiniBand Verbs, and TCP/IP.
  • Implemented using ring and tree algorithms for throughput and latency optimization.
  • Offers batching and aggregation for small operations.

Maintenance & Community

  • Developed by Advanced Micro Devices, Inc.
  • Documentation is open source and can be built locally using Sphinx.

Licensing & Compatibility

  • Copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
  • Modifications copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved.
  • License details are not explicitly stated in the README, but the copyright suggests potential dual-licensing or historical context. Commercial use compatibility requires clarification.

Limitations & Caveats

  • Requires a full ROCm stack installation.
  • Build process and options can be complex; the install.sh script simplifies initial setup.
  • Specific performance claims are not benchmarked within the README.
Health Check
Last commit

1 day ago

Responsiveness

1 week

Pull Requests (30d)
78
Issues (30d)
3
Star History
30 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
2 more.

gpu.cpp by AnswerDotAI

0.2%
4k
C++ library for portable GPU computation using WebGPU
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.