FP16xINT4 kernel for fast LLM inference
Top 42.2% on sourcepulse
Marlin is an optimized FP16xINT4 inference kernel for Large Language Models (LLMs), targeting researchers and engineers needing high-throughput inference. It achieves near-ideal 4x speedups over FP16 for batch sizes up to 16-32, significantly outperforming prior work which degrades rapidly beyond batch size 1-2.
How It Works
Marlin employs a sophisticated strategy to maximize GPU utilization by minimizing memory bottlenecks. It ensures activations are primarily fetched from L2 cache and reused in registers, while asynchronously loading weights with an eviction policy to avoid L2 pollution. Double buffering for shared memory loads overlaps with computation and global loads. Dequantization and tensor core instructions are carefully ordered to saturate both GPU pipelines. Weights and scales are pre-shuffled offline for optimal access patterns, enabling direct dequantization into tensor core formats. The kernel utilizes multiple warps per threadblock for compute and latency hiding, maximizes vector length for loads, and employs layout transformations for conflict-free shared memory access and efficient global reduction.
Quick Start & Requirements
pip install .
in the repository root.transformers
, datasets
, and sentencepiece
.Highlighted Details
Maintenance & Community
The project is associated with IST-DASLab. Citation details are provided for academic use.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Marlin is not yet optimized for NVIDIA Hopper GPUs. The provided GPTQ example is primarily for demonstration and validation, not flexible compression. ECC memory can reduce achievable bandwidth by 10-15%.
11 months ago
1 day