FlagGems  by FlagOpen

Operator library for LLM training/inference, implemented in Triton

created 1 year ago
634 stars

Top 53.3% on sourcepulse

GitHubView on GitHub
Project Summary

FlagGems is a high-performance operator library for large language models, implemented in OpenAI Triton and designed to accelerate LLM training and inference. It targets researchers and engineers working with PyTorch, offering a seamless integration with the ATen backend to boost performance without requiring model code modifications.

How It Works

FlagGems provides a suite of kernel functions written in Triton, a language comparable to CUDA in readability and performance. It integrates with PyTorch's ATen backend, allowing users to switch to Triton kernels with minimal code changes. The library features an automatic code generation system for pointwise and fused operators, supporting various computational needs. A LibEntry mechanism independently manages kernel caching, bypassing traditional autotuning runtimes for simplified cache keys and reduced overhead.

Quick Start & Requirements

  • Install: pip install flaggems (pure Python) or with C++ extensions for improved performance.
  • Prerequisites: PyTorch. Tested on NVIDIA GPUs with float16, float32, and bfloat16 precision.
  • Documentation: GetStart

Highlighted Details

  • Supports a wide range of BLAS, pointwise, reduction, tensor, neural network, basic math, distribution, and science operators.
  • Includes fused operators like silu_and_mul and apply_rotary_position_embedding.
  • Tested with models such as Bert-base-uncased, Llama-2-7b, and Llava-1.5-7b.
  • Offers automatic code generation for custom operators.

Maintenance & Community

  • Active development with regular updates (v1.0, v2.0, v2.1).
  • Contact via email: flaggems@baai.ac.cn or by submitting an issue.
  • WeChat group available for community engagement.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The library primarily targets NVIDIA GPUs, with explicit support for float16, float32, and bfloat16 data types. While it aims for broad operator coverage, specific operator availability should be confirmed against the OperatorList.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
126
Issues (30d)
13
Star History
126 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
7 more.

ThunderKittens by HazyResearch

0.6%
3k
CUDA kernel framework for fast deep learning primitives
created 1 year ago
updated 3 days ago
Feedback? Help us improve.