openai-gemm by openai

GEMM kernels for single/half precision

Created 9 years ago

395 stars

Top 73.0% on SourcePulse

View on GitHub

8 Experts Love This Project

Jonathan Ragan-Kelley

Professor at MIT

Jiaming Song

Chief Scientist at Luma AI

and 4 more!

Project Summary

This repository provides optimized implementations of general matrix multiplication (GEMM) for single (FP32) and half-precision (FP16) floating-point numbers. It targets researchers and engineers working with deep learning or high-performance computing who need faster matrix operations, particularly for small minibatch sizes and FP16 computations, aiming to outperform standard libraries like cuBLAS in specific scenarios.

How It Works

The project implements custom CUDA kernels for GEMM operations. The primary advantage stems from specialized kernel designs that are more efficient for smaller matrix dimensions and FP16 data types compared to general-purpose libraries like cuBLAS. This optimization is achieved through careful memory access patterns, thread block configurations, and instruction selection tailored to the target hardware architecture.

Quick Start & Requirements

Install: Clone the repository and run ./benchmark.py or ./test.py.
Prerequisites: Requires the Nervana neon framework for demonstration code.
Hardware: Primarily targets NVIDIA GPUs (benchmarks shown on Pascal TITAN X and P100).

Highlighted Details

Offers significant speedups over cuBLAS for small minibatch sizes and FP16 operations.
Benchmarks show up to 10-17x speedup in FP16 for certain small matrix dimensions.
Notes that OpenAI kernels do not implement fp16x2 instructions and cuBLAS's FP16 implementation is less efficient for small dimensions.
Mentions potential accuracy considerations for FP16 accumulation with large reductions.

Maintenance & Community

Status: Archived; code is provided as-is with no expected updates.
No community links or active maintenance information is provided.

Licensing & Compatibility

The repository does not explicitly state a license.

Limitations & Caveats

The project is archived and no longer maintained.
The demonstration code has a dependency on the now-archived Nervana neon framework.
OpenAI kernels do not implement fp16x2 instructions.
Potential accuracy issues with FP16 accumulation in large reductions are noted.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days