rccl by ROCm

ROCm library for GPU collective communication routines

Created 8 years ago

409 stars

Top 71.3% on SourcePulse

View on GitHub

2 Experts Love This Project

Zhiqiang Xie

Coauthor of SGLang

Lysandre Debut

Chief Open-Source Officer at Hugging Face

Project Summary

The ROCm Communication Collectives Library (RCCL) provides optimized collective communication routines for GPUs, targeting researchers and developers building large-scale AI and HPC applications. It enables efficient inter-GPU communication across multiple nodes, aiming to maximize bandwidth and minimize latency.

How It Works

RCCL implements standard collective operations like all-reduce, broadcast, and all-gather using ring and tree algorithms. It is optimized for various interconnects (PCIe, xGMI, InfiniBand, TCP/IP) and supports arbitrary numbers of GPUs in single or multi-node, multi-process applications. For performance, small operations can be batched or aggregated via the API.

Quick Start & Requirements

Install: Use the provided install.sh script (./install.sh) or build manually with CMake.
Prerequisites: ROCm stack (HIP runtime & HIP-Clang), ROCm supported GPUs.
Build: install.sh offers options for quick builds, debugging, and targeting specific GPU architectures. Manual build requires cmake .. && make -j <jobs>.
Documentation: Available at RCCL Documentation Site.

Highlighted Details

Supports direct GPU-to-GPU send/receive operations.
Optimized for high bandwidth on PCIe, xGMI, InfiniBand Verbs, and TCP/IP.
Implemented using ring and tree algorithms for throughput and latency optimization.
Offers batching and aggregation for small operations.

Maintenance & Community

Developed by Advanced Micro Devices, Inc.
Documentation is open source and can be built locally using Sphinx.

Licensing & Compatibility

Copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
Modifications copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved.
License details are not explicitly stated in the README, but the copyright suggests potential dual-licensing or historical context. Commercial use compatibility requires clarification.

Limitations & Caveats

Requires a full ROCm stack installation.
Build process and options can be complex; the install.sh script simplifies initial setup.
Specific performance claims are not benchmarked within the README.

Health Check

Last Commit

1 day ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days