cudnn-frontend  by NVIDIA

C++ and Python interface for NVIDIA cuDNN and high-performance kernels

Created 5 years ago
685 stars

Top 49.6% on SourcePulse

GitHubView on GitHub
Project Summary

NVIDIA/cudnn-frontend offers a modern, open-source C++ header-only library and Python interface to the NVIDIA cuDNN library. It simplifies access to cuDNN's Graph API and high-performance kernels, targeting developers seeking to optimize deep learning workloads on NVIDIA hardware. The project enables inspection and contribution to core logic through open-sourced kernels, enhancing transparency and customizability.

How It Works

The library provides a Unified Graph API for defining complex computational subgraphs as reusable cudnn_frontend::graph::Graph objects. It abstracts the boilerplate of the backend cuDNN API through simplified C++ and Python bindings (via pybind11). Key advantages include built-in autotuning, support for the latest NVIDIA GPU architectures, and the ability to leverage and contribute to open-sourced, high-performance kernels like optimized GEMM and Native Sparse Attention.

Quick Start & Requirements

  • Python Installation: pip install nvidia_cudnn_frontend
  • C++ Integration: Include the header files; ensure the include path points to the repository's include/ directory.
  • Prerequisites: Python 3.8+, NVIDIA driver, CUDA Toolkit.
  • Build from Source: Requires python-dev and dependencies listed in requirements.txt. Environment variables CUDAToolkit_ROOT and CUDNN_PATH can override default paths.
  • Documentation: Developer Guide, C++ Samples, Python Samples.

Highlighted Details

  • Open-Source Kernels: Includes implementations for GEMM + Amax (Optimized FP8 matrix multiplication), GEMM + SwiGLU (Fused GEMM with SwiGLU activation), and NSA (Native Sparse Attention).
  • Unified Graph API: Enables creation of reusable, persistent graph objects for complex subgraphs.
  • Performance: Features built-in autotuning and support for the latest NVIDIA GPU architectures. Benchmarks for Scaled Dot-Product Attention (SDPA) on GB200 and GB300 GPUs are available.

Maintenance & Community

Contributions are actively welcomed. The README does not specify community channels (e.g., Discord, Slack) or list notable contributors or sponsorships.

Licensing & Compatibility

Licensed under the MIT License. This permissive license generally allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The provided README does not explicitly detail limitations, alpha status, or known bugs. Building from source is required for C++ samples and Python bindings, suggesting a focus on integration via the header-only C++ API or the pip-installed Python package.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
1
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), and
18 more.

ThunderKittens by HazyResearch

1.3%
3k
CUDA kernel framework for fast deep learning primitives
Created 2 years ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
8 more.

TransformerEngine by NVIDIA

0.3%
3k
Library for Transformer model acceleration on NVIDIA GPUs
Created 3 years ago
Updated 20 hours ago
Starred by Peter Norvig Peter Norvig(Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google), Alexey Milovidov Alexey Milovidov(Cofounder of Clickhouse), and
29 more.

llm.c by karpathy

0.1%
29k
LLM training in pure C/CUDA, no PyTorch needed
Created 1 year ago
Updated 8 months ago
Feedback? Help us improve.