cutlass  by NVIDIA

CUDA C++ and Python DSLs for high-performance linear algebra

Created 8 years ago
8,869 stars

Top 5.8% on SourcePulse

GitHubView on GitHub
Project Summary

CUTLASS provides a comprehensive suite of CUDA C++ template abstractions and a new Python DSL (CuTe) for implementing high-performance matrix-matrix multiplication (GEMM) and related linear algebra computations. It targets researchers, performance engineers, and power users who require optimized GPU kernels, offering a flexible, modular approach to harness the full potential of NVIDIA GPUs across various architectures and data types. The project aims to simplify the development of efficient GPU kernels, enabling faster prototyping and integration with deep learning frameworks.

How It Works

CUTLASS employs a strategy of hierarchical decomposition and data movement abstractions within CUDA. Its core C++ template library allows for fine-grained customization of tiling, data types, and algorithmic policies. The recent addition of the CuTe DSL provides Python-native interfaces, abstracting away C++ complexities and enabling rapid kernel design and metaprogramming. This approach facilitates direct integration with DL frameworks and significantly reduces compile times compared to pure C++ template instantiation.

Quick Start & Requirements

CUTLASS is a header-only library; client applications should target its include/ directory. Building tests and utilities requires CMake.

  • Primary Install: Include CUTLASS headers in your project's include paths.
  • Prerequisites: C++17 compliant host compiler (GCC >= 9 recommended), CUDA Toolkit >= 11.4 (12.8 recommended). Tested on Ubuntu 18.04/20.04/22.04 with GCC.
  • Hardware: Volta (compute capability 7.0) and newer NVIDIA GPUs are supported and expected to yield efficiency.
  • Docs: CUTLASS C++ Quick Start Guide, CuTe DSL Quick Start Guide.

Highlighted Details

  • Supports a wide range of data types including FP64, FP32, TF32, FP16, BF16, FP32 emulation, 8-bit floating point (e5m2, e4m3), block-scaled (MXFP4, MXFP6, MXFP8), narrow integers (4/8-bit), and binary (1-bit) types.
  • The CuTe DSL offers Python interfaces for high-performance CUDA kernel development, targeting Tensor Cores on Ampere, Hopper, and Blackwell architectures.
  • Achieves near-optimal utilization of theoretical peak throughput for GEMM kernels, with performance improvements demonstrated on NVIDIA H100 (Hopper) and Blackwell architectures.
  • Includes a command-line profiler for benchmarking and analyzing CUTLASS kernels.

Maintenance & Community

CUTLASS is developed and released by NVIDIA Corporation. A list of contributors is available in the CONTRIBUTORS file.

Licensing & Compatibility

Released under the permissive 3-clause "New" BSD license, allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

CUTLASS 4.x builds are known to be non-functional on Windows platforms for all CUDA toolkits; the CUTLASS team is actively working on a fix. The CuTe DSL is currently in public beta and is expected to graduate by the end of summer 2025. Kernels compiled with architecture-accelerated features (e.g., sm_90a, sm_100a) may not be forward-compatible with future architectures or specific GPU variants (e.g., Blackwell SM100 vs. RTX 50 series).

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
39
Issues (30d)
65
Star History
176 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
17 more.

ThunderKittens by HazyResearch

0.5%
3k
CUDA kernel framework for fast deep learning primitives
Created 1 year ago
Updated 2 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.2%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 2 days ago
Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
5 more.

lectures by gpu-mode

0.7%
5k
Lecture series for GPU-accelerated computing
Created 1 year ago
Updated 1 week ago
Starred by Nathan Lambert Nathan Lambert(Research Scientist at AI2), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
7 more.

DeepGEMM by deepseek-ai

0.2%
6k
CUDA library for efficient FP8 GEMM kernels with fine-grained scaling
Created 9 months ago
Updated 5 days ago
Feedback? Help us improve.