cutlass by NVIDIA

CUDA C++ and Python DSLs for high-performance linear algebra

Created 8 years ago

9,420 stars

Top 5.5% on SourcePulse

View on GitHub

26 Experts Love This Project

Pankaj Gupta

Cofounder of Baseten

Tri Dao

Chief Scientist at Together AI

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Mehdi Amini

Author of MLIR; Distinguished Engineer at NVIDIA

and 22 more!

Project Summary

CUTLASS provides a comprehensive suite of CUDA C++ template abstractions and a new Python DSL (CuTe) for implementing high-performance matrix-matrix multiplication (GEMM) and related linear algebra computations. It targets researchers, performance engineers, and power users who require optimized GPU kernels, offering a flexible, modular approach to harness the full potential of NVIDIA GPUs across various architectures and data types. The project aims to simplify the development of efficient GPU kernels, enabling faster prototyping and integration with deep learning frameworks.

How It Works

CUTLASS employs a strategy of hierarchical decomposition and data movement abstractions within CUDA. Its core C++ template library allows for fine-grained customization of tiling, data types, and algorithmic policies. The recent addition of the CuTe DSL provides Python-native interfaces, abstracting away C++ complexities and enabling rapid kernel design and metaprogramming. This approach facilitates direct integration with DL frameworks and significantly reduces compile times compared to pure C++ template instantiation.

Quick Start & Requirements

CUTLASS is a header-only library; client applications should target its include/ directory. Building tests and utilities requires CMake.

Primary Install: Include CUTLASS headers in your project's include paths.
Prerequisites: C++17 compliant host compiler (GCC >= 9 recommended), CUDA Toolkit >= 11.4 (12.8 recommended). Tested on Ubuntu 18.04/20.04/22.04 with GCC.
Hardware: Volta (compute capability 7.0) and newer NVIDIA GPUs are supported and expected to yield efficiency.
Docs: CUTLASS C++ Quick Start Guide, CuTe DSL Quick Start Guide.

Highlighted Details

Supports a wide range of data types including FP64, FP32, TF32, FP16, BF16, FP32 emulation, 8-bit floating point (e5m2, e4m3), block-scaled (MXFP4, MXFP6, MXFP8), narrow integers (4/8-bit), and binary (1-bit) types.
The CuTe DSL offers Python interfaces for high-performance CUDA kernel development, targeting Tensor Cores on Ampere, Hopper, and Blackwell architectures.
Achieves near-optimal utilization of theoretical peak throughput for GEMM kernels, with performance improvements demonstrated on NVIDIA H100 (Hopper) and Blackwell architectures.
Includes a command-line profiler for benchmarking and analyzing CUTLASS kernels.

Maintenance & Community

CUTLASS is developed and released by NVIDIA Corporation. A list of contributors is available in the CONTRIBUTORS file.

Licensing & Compatibility

Released under the permissive 3-clause "New" BSD license, allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

CUTLASS 4.x builds are known to be non-functional on Windows platforms for all CUDA toolkits; the CUTLASS team is actively working on a fix. The CuTe DSL is currently in public beta and is expected to graduate by the end of summer 2025. Kernels compiled with architecture-accelerated features (e.g., sm_90a, sm_100a) may not be forward-compatible with future architectures or specific GPU variants (e.g., Blackwell SM100 vs. RTX 50 series).

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

168 stars in the last 30 days