kernl  by ELS-RD

PyTorch transformer inference engine for GPU speedup

Created 3 years ago
1,582 stars

Top 26.5% on SourcePulse

GitHubView on GitHub
Project Summary

Kernl is an open-source Python library designed to accelerate PyTorch transformer model inference on GPUs by several times. It targets researchers and engineers working with large language models who need to improve inference speed and reduce latency, offering a more hackable alternative to traditional inference engines.

How It Works

Kernl leverages OpenAI Triton, a Python-based language for writing GPU kernels, to rewrite critical operations like attention, linear layers, and layernorm. This approach allows for operator fusion, reducing memory bandwidth bottlenecks by avoiding intermediate results storage. It also utilizes CUDA graphs for zero-overhead inference replay and TorchDynamo to handle dynamic model behaviors by tracing and recompiling optimized computation graphs.

Quick Start & Requirements

  • Install via pip: pip install 'git+https://github.com/ELS-RD/kernl'
  • Requires PyTorch, Python >= 3.9, an Ampere GPU, and CUDA.
  • Docker image available for easier setup.
  • See Examples for end-to-end use cases.

Highlighted Details

  • Achieves significant speedups on transformer models like Llama v2, T5, and Whisper.
  • Kernels are written in OpenAI Triton, with individual kernels under 200 lines of code for ease of modification.
  • Supports optimization through kernel fusion and replacement of PyTorch operations with custom Triton kernels.
  • Includes extensive benchmarking tools and conventions for performance analysis.

Maintenance & Community

  • Developed by ELS-RD.
  • Contribution guide and Code of Conduct are available.

Licensing & Compatibility

  • License not explicitly stated in the README.

Limitations & Caveats

  • Requires specific hardware (Ampere GPU) and CUDA installation.
  • Benchmarks can take a considerable amount of time to run.
  • The project is built on newer technologies like Triton and TorchDynamo, which may still be evolving.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
1 more.

VeOmni by ByteDance-Seed

3.4%
1k
Framework for scaling multimodal model training across accelerators
Created 5 months ago
Updated 3 weeks ago
Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
15 more.

ThunderKittens by HazyResearch

0.6%
3k
CUDA kernel framework for fast deep learning primitives
Created 1 year ago
Updated 2 days ago
Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
5 more.

lectures by gpu-mode

0.8%
5k
Lecture series for GPU-accelerated computing
Created 1 year ago
Updated 4 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
20k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Starred by Peter Norvig Peter Norvig(Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google), Alexey Milovidov Alexey Milovidov(Cofounder of Clickhouse), and
29 more.

llm.c by karpathy

0.2%
28k
LLM training in pure C/CUDA, no PyTorch needed
Created 1 year ago
Updated 2 months ago
Feedback? Help us improve.