blocksparse  by openai

TensorFlow ops/GPU kernels for block-sparse matrix multiplication and convolution

Created 7 years ago
1,049 stars

Top 35.9% on SourcePulse

GitHubView on GitHub
Project Summary

This package provides efficient TensorFlow GPU kernels for block-sparse matrix multiplication and convolution, targeting researchers and engineers working with large neural networks where sparsity can significantly improve performance. It offers custom ops for sparse operations, aiming to accelerate training and inference by optimizing memory access and computation on NVIDIA GPUs.

How It Works

The core of the package leverages custom CUDA kernels to implement block-sparse matrix multiplication (BlocksparseMatMul) and convolution (BlocksparseConv). It operates by dividing matrices and filters into blocks, processing only the non-zero blocks to reduce computation and memory bandwidth. The kernels are optimized for specific GPU architectures (Maxwell, Pascal, Volta) and support different sparsity patterns and feature axis layouts, enabling faster execution compared to dense operations or standard sparse formats.

Quick Start & Requirements

  • Install via pip: pip install blocksparse
  • Prerequisites: NVIDIA GPU (Maxwell or newer recommended), Linux (Ubuntu 16.04 tested), CUDA 8, Python 3.5+, TensorFlow 1.4.0+ (with GPU support).
  • CUDA 9/Volta requires updating build targets and recompiling TensorFlow from source.
  • See OpenAI blog post for more details.

Highlighted Details

  • Optimized CUDA kernels for block-sparse matrix multiplication and convolution.
  • Supports various GPU architectures (Kepler, Maxwell, Pascal, Volta) with performance notes.
  • Includes custom ops for layer normalization, batch normalization, and element-wise operations.
  • Offers utilities for weight normalization and gradient aggregation (group_param_grads).

Maintenance & Community

  • Project status is "Active" with ongoing development; breaking changes may occur.
  • Developed by OpenAI.

Licensing & Compatibility

  • License: MIT.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Requires specific NVIDIA GPU hardware and older CUDA/TensorFlow versions for optimal performance.
  • BlocksparseMatMul kernels have different feature_axis support depending on the implementation (ASM vs. CudaC).
  • Some features are experimental (e.g., SparseProj, integrated ReLU in layer_norm).
Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Casper Hansen Casper Hansen(Author of AutoAWQ), and
3 more.

deepsparse by neuralmagic

0%
3k
CPU inference runtime for sparse deep learning models
Created 4 years ago
Updated 3 months ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
15 more.

FasterTransformer by NVIDIA

0.1%
6k
Optimized transformer library for inference
Created 4 years ago
Updated 1 year ago
Feedback? Help us improve.