cutlass-kernels  by ColfaxResearch

High-performance LLM kernels library

Created 2 years ago
255 stars

Top 98.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository offers a library of CUTLASS kernels specifically optimized for Large Language Models (LLMs). It serves as a supplementary resource for developers and researchers working with LLMs, providing experimental kernel variants, potentially including those related to FlashAttention-3, to explore and implement performance enhancements in GPU-accelerated computations.

How It Works

The project leverages the CUTLASS library, a high-performance CUDA C++ template meta-programming library for linear algebra. It provides building blocks for implementing efficient matrix multiplication (GEMM) and related operations. These kernels are tailored for LLM workloads, aiming to maximize GPU throughput and memory bandwidth utilization through advanced techniques like kernel fusion and optimized data layouts.

Quick Start & Requirements

  • Installation: First, download and build the CUTLASS library following instructions from its official repository (https://github.com/NVIDIA/cutlass).
  • Compilation: Modify the compile.sh script within this repository to specify the correct path to your CUTLASS installation. Then, execute the script (./compile.sh).
  • Execution: When running the compiled executable, ensure the environment variable NVIDIA_TF32_OVERRIDE=1 is set to enable TF32 computation mode for cuBLAS SGEMM operations; otherwise, cuBLAS defaults to float32.
  • Prerequisites: A CUDA-enabled GPU environment and the CUTLASS library are required.

Highlighted Details

  • Features experimental kernel variants, potentially including those derived from FlashAttention-3.
  • Focuses on optimizing performance for Large Language Model (LLM) inference and training workloads.

Maintenance & Community

  • The official development and maintenance for FlashAttention-3 kernels are located at https://github.com/Dao-AILab/flash-attention.
  • This repository is intended for experimental purposes and may not receive the same level of support or updates.

Licensing & Compatibility

  • The license type and any compatibility notes for commercial use are not specified in the provided README.

Limitations & Caveats

  • This repository hosts experimental variants and explicitly states it does not guarantee the same level of support as the official FlashAttention-3 project.
  • The build process requires manual modification of hardcoded paths within the compile.sh script.
  • Users should refer to sub-directory READMEs for more specific instructions, as indicated in the main README.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.7%
995
LLM inference engine for diverse applications
Created 2 years ago
Updated 16 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
12 more.

Liger-Kernel by linkedin

0.5%
6k
Triton kernels for efficient LLM training
Created 1 year ago
Updated 4 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Eric Zhang Eric Zhang(Founding Engineer at Modal), and
9 more.

DeepGEMM by deepseek-ai

0.4%
6k
CUDA library for efficient FP8 GEMM kernels with fine-grained scaling
Created 11 months ago
Updated 5 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
9 more.

FlashMLA by deepseek-ai

0.1%
12k
Efficient CUDA kernels for MLA decoding
Created 10 months ago
Updated 3 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
22k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.