BitBLAS by microsoft

Library for mixed-precision matrix multiplications, targeting quantized LLM deployment

Created 1 year ago

741 stars

Top 46.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

BitBLAS is a GPU-accelerated library designed for efficient mixed-precision matrix multiplications, primarily targeting the deployment of quantized Large Language Models (LLMs). It enables high-performance operations for various low-precision data types, offering significant speedups for LLM inference by optimizing computations like GEMV and GEMM.

How It Works

BitBLAS leverages techniques from the "Ladder" paper (OSDI'24), employing hardware-aware tensor transformations to achieve high performance across diverse mixed-precision matrix multiplication scenarios. It supports auto-tensorization for TensorCore-like instructions and provides a flexible DSL (TIR Script) for customizing DNN operations beyond standard matrix multiplication.

Quick Start & Requirements

Install: pip install bitblas or pip install git+https://github.com/microsoft/BitBLAS.git
Prerequisites: Ubuntu 20.04+, Python >= 3.8, CUDA >= 11.0. Pre-built wheels are available for these configurations; otherwise, building from source is required.
Docs: Installation, QuickStart, Python API

Highlighted Details

Supports a wide range of mixed-precision data types, including FP16xINT4, INT8xINT4, FP8, and INT2/INT1.
Achieves up to 8x speedup over cuBLAS for INT2xINT8 GEMV/GEMM on A100 GPUs.
Offers seamless integration with popular LLM frameworks like PyTorch, AutoGPTQ, and vLLM.
Provides a flexible DSL (TIR Script) for custom DNN operation implementation.

Maintenance & Community

The project is actively developed by Microsoft, with recent updates including support for INT4xINT4 matmul, Flash Attention Ops, and performance improvements for contiguous batching. The Ladder paper was presented at OSDI'24.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Pre-built wheels are currently restricted to Ubuntu 20.04+ and CUDA >= 11.0. Users on different platforms or with different CUDA versions will need to build from source. The support matrix is continuously expanding, and specific data type combinations may require custom implementation via the DSL.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

21 stars in the last 30 days