GEMM kernels for single/half precision
Top 75.9% on sourcepulse
This repository provides optimized implementations of general matrix multiplication (GEMM) for single (FP32) and half-precision (FP16) floating-point numbers. It targets researchers and engineers working with deep learning or high-performance computing who need faster matrix operations, particularly for small minibatch sizes and FP16 computations, aiming to outperform standard libraries like cuBLAS in specific scenarios.
How It Works
The project implements custom CUDA kernels for GEMM operations. The primary advantage stems from specialized kernel designs that are more efficient for smaller matrix dimensions and FP16 data types compared to general-purpose libraries like cuBLAS. This optimization is achieved through careful memory access patterns, thread block configurations, and instruction selection tailored to the target hardware architecture.
Quick Start & Requirements
./benchmark.py
or ./test.py
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
2 years ago
1 day