Discover and explore top open-source AI tools and projects—updated daily.
ColfaxResearchHigh-performance LLM kernels library
Top 98.8% on SourcePulse
This repository offers a library of CUTLASS kernels specifically optimized for Large Language Models (LLMs). It serves as a supplementary resource for developers and researchers working with LLMs, providing experimental kernel variants, potentially including those related to FlashAttention-3, to explore and implement performance enhancements in GPU-accelerated computations.
How It Works
The project leverages the CUTLASS library, a high-performance CUDA C++ template meta-programming library for linear algebra. It provides building blocks for implementing efficient matrix multiplication (GEMM) and related operations. These kernels are tailored for LLM workloads, aiming to maximize GPU throughput and memory bandwidth utilization through advanced techniques like kernel fusion and optimized data layouts.
Quick Start & Requirements
https://github.com/NVIDIA/cutlass).compile.sh script within this repository to specify the correct path to your CUTLASS installation. Then, execute the script (./compile.sh).NVIDIA_TF32_OVERRIDE=1 is set to enable TF32 computation mode for cuBLAS SGEMM operations; otherwise, cuBLAS defaults to float32.Highlighted Details
Maintenance & Community
https://github.com/Dao-AILab/flash-attention.Licensing & Compatibility
Limitations & Caveats
compile.sh script.1 year ago
Inactive
ByteDance-Seed
alibaba
linkedin
deepseek-ai
deepseek-ai
Dao-AILab