marlin  by IST-DASLab

FP16xINT4 kernel for fast LLM inference

Created 1 year ago
898 stars

Top 40.4% on SourcePulse

GitHubView on GitHub
Project Summary

Marlin is an optimized FP16xINT4 inference kernel for Large Language Models (LLMs), targeting researchers and engineers needing high-throughput inference. It achieves near-ideal 4x speedups over FP16 for batch sizes up to 16-32, significantly outperforming prior work which degrades rapidly beyond batch size 1-2.

How It Works

Marlin employs a sophisticated strategy to maximize GPU utilization by minimizing memory bottlenecks. It ensures activations are primarily fetched from L2 cache and reused in registers, while asynchronously loading weights with an eviction policy to avoid L2 pollution. Double buffering for shared memory loads overlaps with computation and global loads. Dequantization and tensor core instructions are carefully ordered to saturate both GPU pipelines. Weights and scales are pre-shuffled offline for optimal access patterns, enabling direct dequantization into tensor core formats. The kernel utilizes multiple warps per threadblock for compute and latency hiding, maximizes vector length for loads, and employs layout transformations for conflict-free shared memory access and efficient global reduction.

Quick Start & Requirements

  • Install via pip install . in the repository root.
  • Requires CUDA >= 11.8 and NVIDIA GPU with compute capability >= 8.0 (Ampere/Ada).
  • PyTorch >= 2.0.0 and NumPy are necessary.
  • Quantization scripts require transformers, datasets, and sentencepiece.
  • Official documentation and examples are available within the repository.

Highlighted Details

  • Achieves near-ideal 4x speedup up to batch sizes of 16-32, unlike prior kernels degrading after batch size 1-2.
  • Demonstrates strong performance on real-world matrices and various GPUs due to its "striped" partitioning scheme.
  • Maintains optimal performance even at locked base GPU clock speeds, unaffected by clock throttling.
  • Includes an improved GPTQ algorithm for generating Marlin-compatible 4-bit models and evaluation scripts.

Maintenance & Community

The project is associated with IST-DASLab. Citation details are provided for academic use.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Marlin is not yet optimized for NVIDIA Hopper GPUs. The provided GPTQ example is primarily for demonstration and validation, not flexible compression. ECC memory can reduce achievable bandwidth by 10-15%.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Jeremy Howard Jeremy Howard(Cofounder of fast.ai).

GPTFast by MDK8888

0%
687
HF Transformers accelerator for faster inference
Created 1 year ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
20k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
36 more.

unsloth by unslothai

0.6%
46k
Finetuning tool for LLMs, targeting speed and memory efficiency
Created 1 year ago
Updated 14 hours ago
Feedback? Help us improve.