marlin by IST-DASLab

FP16xINT4 kernel for fast LLM inference

Created 2 years ago

980 stars

Top 37.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Project Summary

Marlin is an optimized FP16xINT4 inference kernel for Large Language Models (LLMs), targeting researchers and engineers needing high-throughput inference. It achieves near-ideal 4x speedups over FP16 for batch sizes up to 16-32, significantly outperforming prior work which degrades rapidly beyond batch size 1-2.

How It Works

Marlin employs a sophisticated strategy to maximize GPU utilization by minimizing memory bottlenecks. It ensures activations are primarily fetched from L2 cache and reused in registers, while asynchronously loading weights with an eviction policy to avoid L2 pollution. Double buffering for shared memory loads overlaps with computation and global loads. Dequantization and tensor core instructions are carefully ordered to saturate both GPU pipelines. Weights and scales are pre-shuffled offline for optimal access patterns, enabling direct dequantization into tensor core formats. The kernel utilizes multiple warps per threadblock for compute and latency hiding, maximizes vector length for loads, and employs layout transformations for conflict-free shared memory access and efficient global reduction.

Quick Start & Requirements

Install via pip install . in the repository root.
Requires CUDA >= 11.8 and NVIDIA GPU with compute capability >= 8.0 (Ampere/Ada).
PyTorch >= 2.0.0 and NumPy are necessary.
Quantization scripts require transformers, datasets, and sentencepiece.
Official documentation and examples are available within the repository.

Highlighted Details

Achieves near-ideal 4x speedup up to batch sizes of 16-32, unlike prior kernels degrading after batch size 1-2.
Demonstrates strong performance on real-world matrices and various GPUs due to its "striped" partitioning scheme.
Maintains optimal performance even at locked base GPU clock speeds, unaffected by clock throttling.
Includes an improved GPTQ algorithm for generating Marlin-compatible 4-bit models and evaluation scripts.

Maintenance & Community

The project is associated with IST-DASLab. Citation details are provided for academic use.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Marlin is not yet optimized for NVIDIA Hopper GPUs. The provided GPTQ example is primarily for demonstration and validation, not flexible compression. ECC memory can reduce achievable bandwidth by 10-15%.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

21 stars in the last 30 days