gdrcopy  by NVIDIA

GPU memory copy library using GPUDirect RDMA

created 10 years ago
1,169 stars

Top 33.9% on sourcepulse

GitHubView on GitHub
Project Summary

GDRCopy is a low-latency library enabling direct CPU access to GPU memory via NVIDIA GPUDirect RDMA. It's designed for researchers and developers requiring high-performance data transfers between CPU and GPU, offering a CPU-driven copy mechanism with minimal overhead.

How It Works

GDRCopy leverages GPUDirect RDMA APIs to create user-space mappings of GPU memory. This allows GPU memory to be treated like host memory, facilitating efficient CPU-driven data transfers. The approach minimizes overhead by avoiding intermediate copies, though an initial memory pinning phase is required.

Quick Start & Requirements

  • Install: Build from source (make), RPM packages (build-rpm-packages.sh), or DEB packages (build-deb-packages.sh).
  • Prerequisites: NVIDIA Data Center/RTX GPU (Kepler+), CUDA >= 6.0, NVIDIA driver >= 418.40 (ppc64le) or >= 331.14 (other platforms), DKMS or equivalent for kernel module installation. GPU driver header files are also required.
  • Supported Platforms: Linux x86_64, ppc64le, arm64 on RHEL8/9, Ubuntu 20.04/22.04, SLE-15, Leap 15.x.
  • Links: GPUDirect RDMA

Highlighted Details

  • Achieves very low CPU-driven copy overhead (e.g., ~0.09 us for small transfers).
  • Host-to-Device (H-D) bandwidth up to 6-8 GB/s, Device-to-Host (D-H) bandwidth is significantly lower due to PCIe limitations.
  • Includes benchmarks for copy bandwidth (gdrcopy_copybw), latency (gdrcopy_copylat), API performance (gdrcopy_apiperf), and ping-pong latency (gdrcopy_pplat).
  • Supports NUMA-aware optimizations for performance tuning.

Maintenance & Community

  • Developed and maintained by NVIDIA.
  • Bug reporting via NVIDIA Developer site.

Licensing & Compatibility

  • License: Not explicitly stated in the README, but typically NVIDIA libraries are subject to NVIDIA's SDK license agreements.
  • Compatibility: Requires specific NVIDIA hardware and drivers. Does not work with CUDA managed memory.

Limitations & Caveats

  • gdr_map() requires addresses aligned to GPU pages; users must ensure alignment.
  • Handling memory regions that span across cudaMalloc allocations is not well-supported.
  • Proprietary driver flavor may have suboptimal performance on coherent platforms or issues with Intel CPUs using confidential computing.
  • Pinning the same GPU address multiple times may consume excessive BAR1 space on some driver versions.
Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
2
Star History
93 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.