T-MAC  by microsoft

Kernel library for low-bit LLM inference on CPUs using lookup tables

created 1 year ago
836 stars

Top 43.5% on sourcepulse

GitHubView on GitHub
Project Summary

T-MAC is a kernel library designed to accelerate low-bit Large Language Model (LLM) inference on CPUs and NPUs. It addresses the computational bottleneck of mixed-precision matrix multiplication (mpGEMM) in quantized LLMs by using lookup tables (LUTs) instead of dequantization, enabling significant speedups and reduced power consumption for edge devices.

How It Works

T-MAC employs a novel LUT-based approach for mpGEMM. It groups low-bit weights (1-4 bits), precomputes all possible partial sums, and stores them in LUTs. This allows for fast table lookups using shift and accumulate operations, bypassing the need for dequantization and traditional fused-multiply-add instructions. This method offers linear scaling of FLOPs and latency with decreasing bit precision, unlike dequantization-based methods.

Quick Start & Requirements

  • Installation: Clone the repository and install via pip (pip install -e . -v) within a virtual environment. Building TVM from source is required, which can take time.
  • Prerequisites: Python 3.8+, CMake >= 3.22, zstd, libomp (macOS), or specific build tools and libraries for Ubuntu/Windows. Native ARM64 tools are recommended for Windows ARM64.
  • Usage: Integration with llama.cpp is provided via an all-in-one script or by building llama.cpp with T-MAC support.
  • Documentation: Android Cross Compilation Guidance

Highlighted Details

  • Achieves 4-5x speedup over llama.cpp on devices like Surface Laptop 7 for BitNet-3B.
  • Demonstrates superior performance and energy efficiency compared to NPUs on Snapdragon X Elite.
  • Offers comparable 2-bit mpGEMM performance to CUDA GPUs on Jetson AGX Orin with significantly lower power consumption.
  • Supports 1/2/4-bit quantized models in GPTQ format for various LLM architectures.

Maintenance & Community

The project is actively developed by Microsoft. Updates include integration into llama.cpp, support for more models (e.g., Qwen2), and improved performance. The paper has been accepted by EuroSys 2025.

Licensing & Compatibility

The repository is licensed under the MIT License, permitting commercial use and closed-source linking.

Limitations & Caveats

Performance on older x86 platforms may vary due to low memory bandwidth; ARM devices or Surface Book 3 are recommended for evaluation. Some models may not be supported by the provided conversion scripts.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
78 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 14 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 14 hours ago
Feedback? Help us improve.