kernl by ELS-RD

PyTorch transformer inference engine for GPU speedup

Created 3 years ago

1,585 stars

Top 26.2% on SourcePulse

8 Experts Love This Project

borzunov

Alexander Borzunov

Research Scientist at OpenAI

casper-hansen

Author of AutoAWQ

apsdehal

Amanpreet Singh

Cofounder of Contextual AI

philschmid

DevRel at Google DeepMind

and 4 more!

Project Summary

Kernl is an open-source Python library designed to accelerate PyTorch transformer model inference on GPUs by several times. It targets researchers and engineers working with large language models who need to improve inference speed and reduce latency, offering a more hackable alternative to traditional inference engines.

How It Works

Kernl leverages OpenAI Triton, a Python-based language for writing GPU kernels, to rewrite critical operations like attention, linear layers, and layernorm. This approach allows for operator fusion, reducing memory bandwidth bottlenecks by avoiding intermediate results storage. It also utilizes CUDA graphs for zero-overhead inference replay and TorchDynamo to handle dynamic model behaviors by tracing and recompiling optimized computation graphs.

Quick Start & Requirements

Install via pip: pip install 'git+https://github.com/ELS-RD/kernl'
Requires PyTorch, Python >= 3.9, an Ampere GPU, and CUDA.
Docker image available for easier setup.
See Examples for end-to-end use cases.

Highlighted Details

Achieves significant speedups on transformer models like Llama v2, T5, and Whisper.
Kernels are written in OpenAI Triton, with individual kernels under 200 lines of code for ease of modification.
Supports optimization through kernel fusion and replacement of PyTorch operations with custom Triton kernels.
Includes extensive benchmarking tools and conventions for performance analysis.

Maintenance & Community

Developed by ELS-RD.
Contribution guide and Code of Conduct are available.

Licensing & Compatibility

License not explicitly stated in the README.

Limitations & Caveats

Requires specific hardware (Ampere GPU) and CUDA installation.
Benchmarks can take a considerable amount of time to run.
The project is built on newer technologies like Triton and TorchDynamo, which may still be evolving.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

1 stars in the last 30 days

Explore Similar Projects

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect),

Wing Lian

Wing Lian(Founder of Axolotl AI), and

1 more.

varuna by microsoft

Tool for efficient large DNN model training on commodity hardware

Created 4 years ago

Updated 1 year ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

ScaleLLM by vectorch-ai

LLM inference system for production environments

Created 2 years ago

Updated 3 weeks ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Wing Lian

Wing Lian(Founder of Axolotl AI), and

2 more.

tensor_parallel by BlackSamorez

PyTorch module for multi-GPU model parallelism

Created 3 years ago

Updated 2 years ago

Starred by

Luca Soldaini

Luca Soldaini(Research Scientist at Ai2),

Edward Sun

Edward Sun(Research Scientist at Meta Superintelligence Lab), and

4 more.

parallelformers by tunib-ai

Toolkit for easy model parallelization

Created 4 years ago

Updated 2 years ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

CUDATutorial by PaddleJitLab

CUDA tutorial for high-performance programming

Created 3 years ago

Updated 1 day ago

Starred by

Luca Antiga

Luca Antiga(CTO of Lightning AI),

William Falcon

William Falcon(Founder of Lightning AI), and

4 more.

lightning-thunder by Lightning-AI

PyTorch compiler for model optimization via source-to-source transformation

Created 1 year ago

Updated 1 day ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and

1 more.

VeOmni by ByteDance-Seed

Framework for scaling multimodal model training across accelerators

Created 9 months ago

Updated 1 day ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Ying Sheng

Ying Sheng(Coauthor of SGLang), and

8 more.

DeepSpeed-MII by deepspeedai

Python library for high-throughput, low-latency, and cost-effective model inference

Created 3 years ago

Updated 6 months ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect), and

17 more.

ThunderKittens by HazyResearch

CUDA kernel framework for fast deep learning primitives

Created 1 year ago

Updated 11 hours ago

Starred by

David Cournapeau

David Cournapeau(Author of scikit-learn),

Stas Bekman

Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and

5 more.

lectures by gpu-mode

Lecture series for GPU-accelerated computing

Created 2 years ago

Updated 1 month ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI), and

34 more.

flash-attention by Dao-AILab

Fast, memory-efficient attention implementation

Created 3 years ago

Updated 22 hours ago

Starred by

Peter Norvig

Peter Norvig(Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google),

Alexey Milovidov

Alexey Milovidov(Cofounder of Clickhouse), and

29 more.

llm.c by karpathy

LLM training in pure C/CUDA, no PyTorch needed

Created 1 year ago

Updated 6 months ago

Feedback? Help us improve.