openai-gemm  by openai

GEMM kernels for single/half precision

created 8 years ago
382 stars

Top 75.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides optimized implementations of general matrix multiplication (GEMM) for single (FP32) and half-precision (FP16) floating-point numbers. It targets researchers and engineers working with deep learning or high-performance computing who need faster matrix operations, particularly for small minibatch sizes and FP16 computations, aiming to outperform standard libraries like cuBLAS in specific scenarios.

How It Works

The project implements custom CUDA kernels for GEMM operations. The primary advantage stems from specialized kernel designs that are more efficient for smaller matrix dimensions and FP16 data types compared to general-purpose libraries like cuBLAS. This optimization is achieved through careful memory access patterns, thread block configurations, and instruction selection tailored to the target hardware architecture.

Quick Start & Requirements

  • Install: Clone the repository and run ./benchmark.py or ./test.py.
  • Prerequisites: Requires the Nervana neon framework for demonstration code.
  • Hardware: Primarily targets NVIDIA GPUs (benchmarks shown on Pascal TITAN X and P100).

Highlighted Details

  • Offers significant speedups over cuBLAS for small minibatch sizes and FP16 operations.
  • Benchmarks show up to 10-17x speedup in FP16 for certain small matrix dimensions.
  • Notes that OpenAI kernels do not implement fp16x2 instructions and cuBLAS's FP16 implementation is less efficient for small dimensions.
  • Mentions potential accuracy considerations for FP16 accumulation with large reductions.

Maintenance & Community

  • Status: Archived; code is provided as-is with no expected updates.
  • No community links or active maintenance information is provided.

Licensing & Compatibility

  • The repository does not explicitly state a license.

Limitations & Caveats

  • The project is archived and no longer maintained.
  • The demonstration code has a dependency on the now-archived Nervana neon framework.
  • OpenAI kernels do not implement fp16x2 instructions.
  • Potential accuracy issues with FP16 accumulation in large reductions are noted.
Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 14 hours ago
Feedback? Help us improve.