Kernel library for low-bit LLM inference on CPUs using lookup tables
Top 43.5% on sourcepulse
T-MAC is a kernel library designed to accelerate low-bit Large Language Model (LLM) inference on CPUs and NPUs. It addresses the computational bottleneck of mixed-precision matrix multiplication (mpGEMM) in quantized LLMs by using lookup tables (LUTs) instead of dequantization, enabling significant speedups and reduced power consumption for edge devices.
How It Works
T-MAC employs a novel LUT-based approach for mpGEMM. It groups low-bit weights (1-4 bits), precomputes all possible partial sums, and stores them in LUTs. This allows for fast table lookups using shift and accumulate operations, bypassing the need for dequantization and traditional fused-multiply-add instructions. This method offers linear scaling of FLOPs and latency with decreasing bit precision, unlike dequantization-based methods.
Quick Start & Requirements
pip install -e . -v
) within a virtual environment. Building TVM from source is required, which can take time.zstd
, libomp
(macOS), or specific build tools and libraries for Ubuntu/Windows. Native ARM64 tools are recommended for Windows ARM64.Highlighted Details
Maintenance & Community
The project is actively developed by Microsoft. Updates include integration into llama.cpp, support for more models (e.g., Qwen2), and improved performance. The paper has been accepted by EuroSys 2025.
Licensing & Compatibility
The repository is licensed under the MIT License, permitting commercial use and closed-source linking.
Limitations & Caveats
Performance on older x86 platforms may vary due to low memory bandwidth; ARM devices or Surface Book 3 are recommended for evaluation. Some models may not be supported by the provided conversion scripts.
1 month ago
1 day