rccl  by ROCm

ROCm library for GPU collective communication routines

Created 7 years ago
365 stars

Top 77.1% on SourcePulse

GitHubView on GitHub
Project Summary

The ROCm Communication Collectives Library (RCCL) provides optimized collective communication routines for GPUs, targeting researchers and developers building large-scale AI and HPC applications. It enables efficient inter-GPU communication across multiple nodes, aiming to maximize bandwidth and minimize latency.

How It Works

RCCL implements standard collective operations like all-reduce, broadcast, and all-gather using ring and tree algorithms. It is optimized for various interconnects (PCIe, xGMI, InfiniBand, TCP/IP) and supports arbitrary numbers of GPUs in single or multi-node, multi-process applications. For performance, small operations can be batched or aggregated via the API.

Quick Start & Requirements

  • Install: Use the provided install.sh script (./install.sh) or build manually with CMake.
  • Prerequisites: ROCm stack (HIP runtime & HIP-Clang), ROCm supported GPUs.
  • Build: install.sh offers options for quick builds, debugging, and targeting specific GPU architectures. Manual build requires cmake .. && make -j <jobs>.
  • Documentation: Available at RCCL Documentation Site.

Highlighted Details

  • Supports direct GPU-to-GPU send/receive operations.
  • Optimized for high bandwidth on PCIe, xGMI, InfiniBand Verbs, and TCP/IP.
  • Implemented using ring and tree algorithms for throughput and latency optimization.
  • Offers batching and aggregation for small operations.

Maintenance & Community

  • Developed by Advanced Micro Devices, Inc.
  • Documentation is open source and can be built locally using Sphinx.

Licensing & Compatibility

  • Copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
  • Modifications copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved.
  • License details are not explicitly stated in the README, but the copyright suggests potential dual-licensing or historical context. Commercial use compatibility requires clarification.

Limitations & Caveats

  • Requires a full ROCm stack installation.
  • Build process and options can be complex; the install.sh script simplifies initial setup.
  • Specific performance claims are not benchmarked within the README.
Health Check
Last Commit

15 hours ago

Responsiveness

1 week

Pull Requests (30d)
68
Issues (30d)
2
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Feedback? Help us improve.