FlagGems by flagos-ai

Operator library for LLM training/inference, implemented in Triton

Created 1 year ago

863 stars

Top 41.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Project Summary

FlagGems is a high-performance operator library for large language models, implemented in OpenAI Triton and designed to accelerate LLM training and inference. It targets researchers and engineers working with PyTorch, offering a seamless integration with the ATen backend to boost performance without requiring model code modifications.

How It Works

FlagGems provides a suite of kernel functions written in Triton, a language comparable to CUDA in readability and performance. It integrates with PyTorch's ATen backend, allowing users to switch to Triton kernels with minimal code changes. The library features an automatic code generation system for pointwise and fused operators, supporting various computational needs. A LibEntry mechanism independently manages kernel caching, bypassing traditional autotuning runtimes for simplified cache keys and reduced overhead.

Quick Start & Requirements

Install: pip install flaggems (pure Python) or with C++ extensions for improved performance.
Prerequisites: PyTorch. Tested on NVIDIA GPUs with float16, float32, and bfloat16 precision.
Documentation: GetStart

Highlighted Details

Supports a wide range of BLAS, pointwise, reduction, tensor, neural network, basic math, distribution, and science operators.
Includes fused operators like silu_and_mul and apply_rotary_position_embedding.
Tested with models such as Bert-base-uncased, Llama-2-7b, and Llava-1.5-7b.
Offers automatic code generation for custom operators.

Maintenance & Community

Active development with regular updates (v1.0, v2.0, v2.1).
Contact via email: flaggems@baai.ac.cn or by submitting an issue.
WeChat group available for community engagement.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The library primarily targets NVIDIA GPUs, with explicit support for float16, float32, and bfloat16 data types. While it aims for broad operator coverage, specific operator availability should be confirmed against the OperatorList.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

195

Issues (30d)

Star History

73 stars in the last 30 days