GDRCopy is a low-latency library enabling direct CPU access to GPU memory via NVIDIA GPUDirect RDMA. It's designed for researchers and developers requiring high-performance data transfers between CPU and GPU, offering a CPU-driven copy mechanism with minimal overhead.
How It Works
GDRCopy leverages GPUDirect RDMA APIs to create user-space mappings of GPU memory. This allows GPU memory to be treated like host memory, facilitating efficient CPU-driven data transfers. The approach minimizes overhead by avoiding intermediate copies, though an initial memory pinning phase is required.
Quick Start & Requirements
- Install: Build from source (
make
), RPM packages (build-rpm-packages.sh
), or DEB packages (build-deb-packages.sh
).
- Prerequisites: NVIDIA Data Center/RTX GPU (Kepler+), CUDA >= 6.0, NVIDIA driver >= 418.40 (ppc64le) or >= 331.14 (other platforms), DKMS or equivalent for kernel module installation. GPU driver header files are also required.
- Supported Platforms: Linux x86_64, ppc64le, arm64 on RHEL8/9, Ubuntu 20.04/22.04, SLE-15, Leap 15.x.
- Links: GPUDirect RDMA
Highlighted Details
- Achieves very low CPU-driven copy overhead (e.g., ~0.09 us for small transfers).
- Host-to-Device (H-D) bandwidth up to 6-8 GB/s, Device-to-Host (D-H) bandwidth is significantly lower due to PCIe limitations.
- Includes benchmarks for copy bandwidth (
gdrcopy_copybw
), latency (gdrcopy_copylat
), API performance (gdrcopy_apiperf
), and ping-pong latency (gdrcopy_pplat
).
- Supports NUMA-aware optimizations for performance tuning.
Maintenance & Community
- Developed and maintained by NVIDIA.
- Bug reporting via NVIDIA Developer site.
Licensing & Compatibility
- License: Not explicitly stated in the README, but typically NVIDIA libraries are subject to NVIDIA's SDK license agreements.
- Compatibility: Requires specific NVIDIA hardware and drivers. Does not work with CUDA managed memory.
Limitations & Caveats
gdr_map()
requires addresses aligned to GPU pages; users must ensure alignment.
- Handling memory regions that span across
cudaMalloc
allocations is not well-supported.
- Proprietary driver flavor may have suboptimal performance on coherent platforms or issues with Intel CPUs using confidential computing.
- Pinning the same GPU address multiple times may consume excessive BAR1 space on some driver versions.