uccl by uccl-project

GPU collective communication library for ML workloads

Created 10 months ago

1,100 stars

Top 34.5% on SourcePulse

View on GitHub

2 Experts Love This Project

Kaichao You

Core Maintainer of vLLM

Robert Nishihara

Cofounder of Anyscale; Author of Ray

Project Summary

UCCL is an open-source collective communication library designed to enhance GPU communication performance for machine learning workloads, offering a drop-in replacement for NCCL/RCCL. It targets researchers and practitioners seeking higher latency and throughput, particularly in heterogeneous GPU and networking environments.

How It Works

UCCL re-architects the communication layer to maximize hardware potential, featuring a custom software transport layer that employs packet spraying across numerous network paths to avoid congestion. This approach, combined with advanced congestion control and efficient loss recovery, aims to outperform traditional single-path transports like kernel TCP and RDMA.

Quick Start & Requirements

Install via git clone and bash build_and_install.sh [cuda|rocm].
Requires CUDA or ROCm.
Usage involves setting NCCL_NET_PLUGIN and LD_PRELOAD environment variables to point to UCCL plugins for specific network configurations (IB/RoCE, AWS EFA).
Official website: https://uccl-project.github.io/

Highlighted Details

Up to 2.5x performance improvement over NCCL for AllReduce on HGX servers with H100 GPUs.
Up to 3.3x improvement for AlltoAll on AWS p4d instances with A100 GPUs.
Up to 3.7x improvement for AllReduce on AWS g4dn instances with T4 GPUs.
Supports heterogeneous GPU and networking vendors (Nvidia, AMD, Broadcom).
Aims to provide vendor-agnostic Triton kernels for collectives.

Maintenance & Community

Actively developed at UC Berkeley Sky Computing Lab and UC Davis ArtSy lab. Supported by AMD, AWS, Broadcom, CloudLab, Google Cloud, IBM, Lambda, and Mibura. Community engagement via GitHub issues.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is under active development, with features like dynamic membership and improved KV cache transfer still pending. The absence of a specified license may pose adoption challenges for commercial applications.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

385 stars in the last 30 days