BitBLAS  by microsoft

Library for mixed-precision matrix multiplications, targeting quantized LLM deployment

Created 1 year ago
675 stars

Top 50.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

BitBLAS is a GPU-accelerated library designed for efficient mixed-precision matrix multiplications, primarily targeting the deployment of quantized Large Language Models (LLMs). It enables high-performance operations for various low-precision data types, offering significant speedups for LLM inference by optimizing computations like GEMV and GEMM.

How It Works

BitBLAS leverages techniques from the "Ladder" paper (OSDI'24), employing hardware-aware tensor transformations to achieve high performance across diverse mixed-precision matrix multiplication scenarios. It supports auto-tensorization for TensorCore-like instructions and provides a flexible DSL (TIR Script) for customizing DNN operations beyond standard matrix multiplication.

Quick Start & Requirements

  • Install: pip install bitblas or pip install git+https://github.com/microsoft/BitBLAS.git
  • Prerequisites: Ubuntu 20.04+, Python >= 3.8, CUDA >= 11.0. Pre-built wheels are available for these configurations; otherwise, building from source is required.
  • Docs: Installation, QuickStart, Python API

Highlighted Details

  • Supports a wide range of mixed-precision data types, including FP16xINT4, INT8xINT4, FP8, and INT2/INT1.
  • Achieves up to 8x speedup over cuBLAS for INT2xINT8 GEMV/GEMM on A100 GPUs.
  • Offers seamless integration with popular LLM frameworks like PyTorch, AutoGPTQ, and vLLM.
  • Provides a flexible DSL (TIR Script) for custom DNN operation implementation.

Maintenance & Community

The project is actively developed by Microsoft, with recent updates including support for INT4xINT4 matmul, Flash Attention Ops, and performance improvements for contiguous batching. The Ladder paper was presented at OSDI'24.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Pre-built wheels are currently restricted to Ubuntu 20.04+ and CUDA >= 11.0. Users on different platforms or with different CUDA versions will need to build from source. The support matrix is continuously expanding, and specific data type combinations may require custom implementation via the DSL.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
1
Star History
15 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.