openai-gemm  by openai

GEMM kernels for single/half precision

Created 9 years ago
384 stars

Top 74.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides optimized implementations of general matrix multiplication (GEMM) for single (FP32) and half-precision (FP16) floating-point numbers. It targets researchers and engineers working with deep learning or high-performance computing who need faster matrix operations, particularly for small minibatch sizes and FP16 computations, aiming to outperform standard libraries like cuBLAS in specific scenarios.

How It Works

The project implements custom CUDA kernels for GEMM operations. The primary advantage stems from specialized kernel designs that are more efficient for smaller matrix dimensions and FP16 data types compared to general-purpose libraries like cuBLAS. This optimization is achieved through careful memory access patterns, thread block configurations, and instruction selection tailored to the target hardware architecture.

Quick Start & Requirements

  • Install: Clone the repository and run ./benchmark.py or ./test.py.
  • Prerequisites: Requires the Nervana neon framework for demonstration code.
  • Hardware: Primarily targets NVIDIA GPUs (benchmarks shown on Pascal TITAN X and P100).

Highlighted Details

  • Offers significant speedups over cuBLAS for small minibatch sizes and FP16 operations.
  • Benchmarks show up to 10-17x speedup in FP16 for certain small matrix dimensions.
  • Notes that OpenAI kernels do not implement fp16x2 instructions and cuBLAS's FP16 implementation is less efficient for small dimensions.
  • Mentions potential accuracy considerations for FP16 accumulation with large reductions.

Maintenance & Community

  • Status: Archived; code is provided as-is with no expected updates.
  • No community links or active maintenance information is provided.

Licensing & Compatibility

  • The repository does not explicitly state a license.

Limitations & Caveats

  • The project is archived and no longer maintained.
  • The demonstration code has a dependency on the now-archived Nervana neon framework.
  • OpenAI kernels do not implement fp16x2 instructions.
  • Potential accuracy issues with FP16 accumulation in large reductions are noted.
Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

KernelBench by ScalingIntelligence

1.9%
569
Benchmark for LLMs generating GPU kernels from PyTorch ops
Created 10 months ago
Updated 3 weeks ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
5 more.

lectures by gpu-mode

0.8%
5k
Lecture series for GPU-accelerated computing
Created 1 year ago
Updated 4 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
20k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.