T-MAC  by microsoft

Kernel library for low-bit LLM inference on CPUs using lookup tables

Created 1 year ago
856 stars

Top 41.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

T-MAC is a kernel library designed to accelerate low-bit Large Language Model (LLM) inference on CPUs and NPUs. It addresses the computational bottleneck of mixed-precision matrix multiplication (mpGEMM) in quantized LLMs by using lookup tables (LUTs) instead of dequantization, enabling significant speedups and reduced power consumption for edge devices.

How It Works

T-MAC employs a novel LUT-based approach for mpGEMM. It groups low-bit weights (1-4 bits), precomputes all possible partial sums, and stores them in LUTs. This allows for fast table lookups using shift and accumulate operations, bypassing the need for dequantization and traditional fused-multiply-add instructions. This method offers linear scaling of FLOPs and latency with decreasing bit precision, unlike dequantization-based methods.

Quick Start & Requirements

  • Installation: Clone the repository and install via pip (pip install -e . -v) within a virtual environment. Building TVM from source is required, which can take time.
  • Prerequisites: Python 3.8+, CMake >= 3.22, zstd, libomp (macOS), or specific build tools and libraries for Ubuntu/Windows. Native ARM64 tools are recommended for Windows ARM64.
  • Usage: Integration with llama.cpp is provided via an all-in-one script or by building llama.cpp with T-MAC support.
  • Documentation: Android Cross Compilation Guidance

Highlighted Details

  • Achieves 4-5x speedup over llama.cpp on devices like Surface Laptop 7 for BitNet-3B.
  • Demonstrates superior performance and energy efficiency compared to NPUs on Snapdragon X Elite.
  • Offers comparable 2-bit mpGEMM performance to CUDA GPUs on Jetson AGX Orin with significantly lower power consumption.
  • Supports 1/2/4-bit quantized models in GPTQ format for various LLM architectures.

Maintenance & Community

The project is actively developed by Microsoft. Updates include integration into llama.cpp, support for more models (e.g., Qwen2), and improved performance. The paper has been accepted by EuroSys 2025.

Licensing & Compatibility

The repository is licensed under the MIT License, permitting commercial use and closed-source linking.

Limitations & Caveats

Performance on older x86 platforms may vary due to low memory bandwidth; ARM devices or Surface Book 3 are recommended for evaluation. Some models may not be supported by the provided conversion scripts.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Feedback? Help us improve.