blocksparse  by openai

TensorFlow ops/GPU kernels for block-sparse matrix multiplication and convolution

created 7 years ago
1,043 stars

Top 36.7% on sourcepulse

GitHubView on GitHub
Project Summary

This package provides efficient TensorFlow GPU kernels for block-sparse matrix multiplication and convolution, targeting researchers and engineers working with large neural networks where sparsity can significantly improve performance. It offers custom ops for sparse operations, aiming to accelerate training and inference by optimizing memory access and computation on NVIDIA GPUs.

How It Works

The core of the package leverages custom CUDA kernels to implement block-sparse matrix multiplication (BlocksparseMatMul) and convolution (BlocksparseConv). It operates by dividing matrices and filters into blocks, processing only the non-zero blocks to reduce computation and memory bandwidth. The kernels are optimized for specific GPU architectures (Maxwell, Pascal, Volta) and support different sparsity patterns and feature axis layouts, enabling faster execution compared to dense operations or standard sparse formats.

Quick Start & Requirements

  • Install via pip: pip install blocksparse
  • Prerequisites: NVIDIA GPU (Maxwell or newer recommended), Linux (Ubuntu 16.04 tested), CUDA 8, Python 3.5+, TensorFlow 1.4.0+ (with GPU support).
  • CUDA 9/Volta requires updating build targets and recompiling TensorFlow from source.
  • See OpenAI blog post for more details.

Highlighted Details

  • Optimized CUDA kernels for block-sparse matrix multiplication and convolution.
  • Supports various GPU architectures (Kepler, Maxwell, Pascal, Volta) with performance notes.
  • Includes custom ops for layer normalization, batch normalization, and element-wise operations.
  • Offers utilities for weight normalization and gradient aggregation (group_param_grads).

Maintenance & Community

  • Project status is "Active" with ongoing development; breaking changes may occur.
  • Developed by OpenAI.

Licensing & Compatibility

  • License: MIT.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Requires specific NVIDIA GPU hardware and older CUDA/TensorFlow versions for optimal performance.
  • BlocksparseMatMul kernels have different feature_axis support depending on the implementation (ASM vs. CudaC).
  • Some features are experimental (e.g., SparseProj, integrated ReLU in layer_norm).
Health Check
Last commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 16 hours ago
Feedback? Help us improve.