cutlass-kernels by ColfaxResearch

High-performance LLM kernels library

Created 2 years ago

261 stars

Top 97.3% on SourcePulse

Project Summary

This repository offers a library of CUTLASS kernels specifically optimized for Large Language Models (LLMs). It serves as a supplementary resource for developers and researchers working with LLMs, providing experimental kernel variants, potentially including those related to FlashAttention-3, to explore and implement performance enhancements in GPU-accelerated computations.

How It Works

The project leverages the CUTLASS library, a high-performance CUDA C++ template meta-programming library for linear algebra. It provides building blocks for implementing efficient matrix multiplication (GEMM) and related operations. These kernels are tailored for LLM workloads, aiming to maximize GPU throughput and memory bandwidth utilization through advanced techniques like kernel fusion and optimized data layouts.

Quick Start & Requirements

Installation: First, download and build the CUTLASS library following instructions from its official repository (https://github.com/NVIDIA/cutlass).
Compilation: Modify the compile.sh script within this repository to specify the correct path to your CUTLASS installation. Then, execute the script (./compile.sh).
Execution: When running the compiled executable, ensure the environment variable NVIDIA_TF32_OVERRIDE=1 is set to enable TF32 computation mode for cuBLAS SGEMM operations; otherwise, cuBLAS defaults to float32.
Prerequisites: A CUDA-enabled GPU environment and the CUTLASS library are required.

Highlighted Details

Features experimental kernel variants, potentially including those derived from FlashAttention-3.
Focuses on optimizing performance for Large Language Model (LLM) inference and training workloads.

Maintenance & Community

The official development and maintenance for FlashAttention-3 kernels are located at https://github.com/Dao-AILab/flash-attention.
This repository is intended for experimental purposes and may not receive the same level of support or updates.

Licensing & Compatibility

The license type and any compatibility notes for commercial use are not specified in the provided README.

Limitations & Caveats

This repository hosts experimental variants and explicitly states it does not guarantee the same level of support as the official FlashAttention-3 project.
The build process requires manual modification of hardcoded paths within the compile.sh script.
Users should refer to sub-directory READMEs for more specific instructions, as indicated in the main README.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days