BitBLAS  by microsoft

Library for mixed-precision matrix multiplications, targeting quantized LLM deployment

created 1 year ago
654 stars

Top 52.0% on sourcepulse

GitHubView on GitHub
Project Summary

BitBLAS is a GPU-accelerated library designed for efficient mixed-precision matrix multiplications, primarily targeting the deployment of quantized Large Language Models (LLMs). It enables high-performance operations for various low-precision data types, offering significant speedups for LLM inference by optimizing computations like GEMV and GEMM.

How It Works

BitBLAS leverages techniques from the "Ladder" paper (OSDI'24), employing hardware-aware tensor transformations to achieve high performance across diverse mixed-precision matrix multiplication scenarios. It supports auto-tensorization for TensorCore-like instructions and provides a flexible DSL (TIR Script) for customizing DNN operations beyond standard matrix multiplication.

Quick Start & Requirements

  • Install: pip install bitblas or pip install git+https://github.com/microsoft/BitBLAS.git
  • Prerequisites: Ubuntu 20.04+, Python >= 3.8, CUDA >= 11.0. Pre-built wheels are available for these configurations; otherwise, building from source is required.
  • Docs: Installation, QuickStart, Python API

Highlighted Details

  • Supports a wide range of mixed-precision data types, including FP16xINT4, INT8xINT4, FP8, and INT2/INT1.
  • Achieves up to 8x speedup over cuBLAS for INT2xINT8 GEMV/GEMM on A100 GPUs.
  • Offers seamless integration with popular LLM frameworks like PyTorch, AutoGPTQ, and vLLM.
  • Provides a flexible DSL (TIR Script) for custom DNN operation implementation.

Maintenance & Community

The project is actively developed by Microsoft, with recent updates including support for INT4xINT4 matmul, Flash Attention Ops, and performance improvements for contiguous batching. The Ladder paper was presented at OSDI'24.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Pre-built wheels are currently restricted to Ubuntu 20.04+ and CUDA >= 11.0. Users on different platforms or with different CUDA versions will need to build from source. The support matrix is continuously expanding, and specific data type combinations may require custom implementation via the DSL.

Health Check
Last commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
0
Star History
56 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 17 hours ago
Feedback? Help us improve.