Library for mixed-precision matrix multiplications, targeting quantized LLM deployment
Top 52.0% on sourcepulse
BitBLAS is a GPU-accelerated library designed for efficient mixed-precision matrix multiplications, primarily targeting the deployment of quantized Large Language Models (LLMs). It enables high-performance operations for various low-precision data types, offering significant speedups for LLM inference by optimizing computations like GEMV and GEMM.
How It Works
BitBLAS leverages techniques from the "Ladder" paper (OSDI'24), employing hardware-aware tensor transformations to achieve high performance across diverse mixed-precision matrix multiplication scenarios. It supports auto-tensorization for TensorCore-like instructions and provides a flexible DSL (TIR Script) for customizing DNN operations beyond standard matrix multiplication.
Quick Start & Requirements
pip install bitblas
or pip install git+https://github.com/microsoft/BitBLAS.git
Highlighted Details
Maintenance & Community
The project is actively developed by Microsoft, with recent updates including support for INT4xINT4 matmul, Flash Attention Ops, and performance improvements for contiguous batching. The Ladder paper was presented at OSDI'24.
Licensing & Compatibility
The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
Pre-built wheels are currently restricted to Ubuntu 20.04+ and CUDA >= 11.0. Users on different platforms or with different CUDA versions will need to build from source. The support matrix is continuously expanding, and specific data type combinations may require custom implementation via the DSL.
3 weeks ago
1 day