marlin  by IST-DASLab

FP16xINT4 kernel for fast LLM inference

created 1 year ago
870 stars

Top 42.2% on sourcepulse

GitHubView on GitHub
Project Summary

Marlin is an optimized FP16xINT4 inference kernel for Large Language Models (LLMs), targeting researchers and engineers needing high-throughput inference. It achieves near-ideal 4x speedups over FP16 for batch sizes up to 16-32, significantly outperforming prior work which degrades rapidly beyond batch size 1-2.

How It Works

Marlin employs a sophisticated strategy to maximize GPU utilization by minimizing memory bottlenecks. It ensures activations are primarily fetched from L2 cache and reused in registers, while asynchronously loading weights with an eviction policy to avoid L2 pollution. Double buffering for shared memory loads overlaps with computation and global loads. Dequantization and tensor core instructions are carefully ordered to saturate both GPU pipelines. Weights and scales are pre-shuffled offline for optimal access patterns, enabling direct dequantization into tensor core formats. The kernel utilizes multiple warps per threadblock for compute and latency hiding, maximizes vector length for loads, and employs layout transformations for conflict-free shared memory access and efficient global reduction.

Quick Start & Requirements

  • Install via pip install . in the repository root.
  • Requires CUDA >= 11.8 and NVIDIA GPU with compute capability >= 8.0 (Ampere/Ada).
  • PyTorch >= 2.0.0 and NumPy are necessary.
  • Quantization scripts require transformers, datasets, and sentencepiece.
  • Official documentation and examples are available within the repository.

Highlighted Details

  • Achieves near-ideal 4x speedup up to batch sizes of 16-32, unlike prior kernels degrading after batch size 1-2.
  • Demonstrates strong performance on real-world matrices and various GPUs due to its "striped" partitioning scheme.
  • Maintains optimal performance even at locked base GPU clock speeds, unaffected by clock throttling.
  • Includes an improved GPTQ algorithm for generating Marlin-compatible 4-bit models and evaluation scripts.

Maintenance & Community

The project is associated with IST-DASLab. Citation details are provided for academic use.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Marlin is not yet optimized for NVIDIA Hopper GPUs. The provided GPTQ example is primarily for demonstration and validation, not flexible compression. ECC memory can reduce achievable bandwidth by 10-15%.

Health Check
Last commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
60 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 10 hours ago
Feedback? Help us improve.