Liger-Kernel  by linkedin

Triton kernels for efficient LLM training

Created 1 year ago
5,662 stars

Top 9.1% on SourcePulse

GitHubView on GitHub
Project Summary

Liger Kernel provides a suite of optimized Triton kernels designed to significantly enhance the efficiency of Large Language Model (LLM) training. Targeting researchers and engineers working with LLMs, it offers substantial improvements in training throughput and memory usage, enabling larger models and longer context lengths.

How It Works

Liger Kernel leverages Triton's capabilities for low-level GPU programming to fuse common LLM operations like RMSNorm, RoPE, SwiGLU, and various loss functions. This fusion, combined with techniques like in-place computation and chunking, reduces memory bandwidth requirements and computational overhead. The kernels are designed for exact computation, ensuring no loss of accuracy compared to standard implementations.

Quick Start & Requirements

  • Installation: pip install liger-kernel (stable) or pip install liger-kernel-nightly (nightly). Install from source via git clone and pip install -e ..
  • Prerequisites: CUDA (>= 2.1.2) with Triton (>= 2.3.0) for NVIDIA, or ROCm (>= 2.5.0) with Triton (>= 3.0.0) for AMD. transformers (>= 4.x) is required for patching APIs.
  • Setup: Minimal dependencies, primarily Torch and Triton.
  • Resources: Supports multi-GPU setups (FSDP, DeepSpeed, DDP).
  • Documentation: Getting Started, Examples, High-level APIs, Low-level APIs.

Highlighted Details

  • Up to 20% throughput increase and 60% memory reduction for LLM training layers.
  • Up to 80% memory savings for post-training alignment and distillation tasks (DPO, ORPO, CPO, etc.).
  • Full AMD ROCm support alongside NVIDIA CUDA.
  • One-line patching for Hugging Face models or direct composition of custom models.

Maintenance & Community

Actively developed by LinkedIn, with significant community contributions (50+ PRs, 10+ contributors). Supported by NVIDIA, AMD, and Intel for GPU resources. Integrations with Hugging Face, Lightning AI, Axolotl, and Llama-Factory. Discord channel available for discussion.

Licensing & Compatibility

The project is licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

While generally stable, some kernels are marked as experimental. Compatibility with specific model architectures not explicitly listed in the high-level APIs may require manual integration using low-level APIs.

Health Check
Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
18
Issues (30d)
14
Star History
135 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
4 more.

ml-cross-entropy by apple

0.4%
520
PyTorch module for memory-efficient cross-entropy in LLMs
Created 10 months ago
Updated 23 hours ago
Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
2 more.

YaFSDP by yandex

0.1%
975
Sharded data parallelism framework for transformer-like neural networks
Created 1 year ago
Updated 3 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
20k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.