cudnn-frontend  by NVIDIA

C++ and Python interface for NVIDIA cuDNN and high-performance kernels

Created 5 years ago
829 stars

Top 42.3% on SourcePulse

GitHubView on GitHub
Project Summary

NVIDIA/cudnn-frontend offers a modern, open-source C++ header-only library and Python interface to the NVIDIA cuDNN library. It simplifies access to cuDNN's Graph API and high-performance kernels, targeting developers seeking to optimize deep learning workloads on NVIDIA hardware. The project enables inspection and contribution to core logic through open-sourced kernels, enhancing transparency and customizability.

How It Works

The library provides a Unified Graph API for defining complex computational subgraphs as reusable cudnn_frontend::graph::Graph objects. It abstracts the boilerplate of the backend cuDNN API through simplified C++ and Python bindings (via pybind11). Key advantages include built-in autotuning, support for the latest NVIDIA GPU architectures, and the ability to leverage and contribute to open-sourced, high-performance kernels like optimized GEMM and Native Sparse Attention.

Quick Start & Requirements

  • Python Installation: pip install nvidia_cudnn_frontend
  • C++ Integration: Include the header files; ensure the include path points to the repository's include/ directory.
  • Prerequisites: Python 3.8+, NVIDIA driver, CUDA Toolkit.
  • Build from Source: Requires python-dev and dependencies listed in requirements.txt. Environment variables CUDAToolkit_ROOT and CUDNN_PATH can override default paths.
  • Documentation: Developer Guide, C++ Samples, Python Samples.

Highlighted Details

  • Open-Source Kernels: Includes implementations for GEMM + Amax (Optimized FP8 matrix multiplication), GEMM + SwiGLU (Fused GEMM with SwiGLU activation), and NSA (Native Sparse Attention).
  • Unified Graph API: Enables creation of reusable, persistent graph objects for complex subgraphs.
  • Performance: Features built-in autotuning and support for the latest NVIDIA GPU architectures. Benchmarks for Scaled Dot-Product Attention (SDPA) on GB200 and GB300 GPUs are available.

Maintenance & Community

Contributions are actively welcomed. The README does not specify community channels (e.g., Discord, Slack) or list notable contributors or sponsorships.

Licensing & Compatibility

Licensed under the MIT License. This permissive license generally allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The provided README does not explicitly detail limitations, alpha status, or known bugs. Building from source is required for C++ samples and Python bindings, suggesting a focus on integration via the header-only C++ API or the pip-installed Python package.

Health Check
Last Commit

22 hours ago

Responsiveness

Inactive

Pull Requests (30d)
20
Issues (30d)
11
Star History
108 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), and
18 more.

ThunderKittens by HazyResearch

0.6%
3k
CUDA kernel framework for fast deep learning primitives
Created 2 years ago
Updated 5 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
8 more.

TransformerEngine by NVIDIA

0.2%
3k
Library for Transformer model acceleration on NVIDIA GPUs
Created 3 years ago
Updated 18 hours ago
Starred by Peter Norvig Peter Norvig(Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google), Alexey Milovidov Alexey Milovidov(Cofounder of Clickhouse), and
29 more.

llm.c by karpathy

0.2%
30k
LLM training in pure C/CUDA, no PyTorch needed
Created 2 years ago
Updated 11 months ago
Feedback? Help us improve.