ppl.nn by OpenPPL

Inference engine for efficient AI inferencing

Created 4 years ago

1,369 stars

Top 29.3% on SourcePulse

Project Summary

PPLNN is a high-performance deep-learning inference engine designed for efficient AI inferencing, particularly for Large Language Models (LLMs). It supports various ONNX models and offers enhanced compatibility with the OpenMMLab ecosystem, targeting developers and researchers needing optimized inference performance.

How It Works

PPLNN leverages a primitive library approach for neural networks, focusing on efficient execution of ONNX models. It incorporates advanced LLM features like Flash Attention, Split-k Attention, Group-query Attention, and dynamic batching. The engine also supports Tensor Parallelism and graph optimizations, including INT8 quantization for KV Cache and weights/activations, aiming for near FP16 accuracy with reduced memory footprint and increased speed.

Quick Start & Requirements

Install: Clone the repository and build from source using ./build.sh -DPPLNN_USE_X86_64=ON -DPPLNN_ENABLE_PYTHON_API=ON.
Prerequisites: build-essential, cmake, git, python3, python3-dev (Debian/Ubuntu) or gcc, gcc-c++, cmake3, make, git, python3, python3-devel (RedHat/CentOS).
Demo: PYTHONPATH=./pplnn-build/install/lib python3 ./tools/pplnn.py --use-x86 --onnx-model tests/testdata/conv.onnx
Docs: Building from Source, API Reference

Highlighted Details

Supports LLaMA, ChatGLM, Baichuan, InternLM, Mixtral, Qwen, Falcon, and Bigcode models.
Offers INT8 groupwise KV Cache and INT8 per-token per-channel quantization.
Includes Flash Attention and Split-k Attention for improved LLM performance.
Supports X86, CUDA, RISCV, and ARM architectures.

Maintenance & Community

Active development with a focus on LLM optimizations.
Community support via GitHub Issues and a QQ group (627853444).

Licensing & Compatibility

Distributed under the Apache License, Version 2.0.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

A known issue exists with NCCL AllReduce on L40S and H800 GPUs, which can be mitigated by setting NCCL_PROTO=^Simple. ChatGLM1 is no longer supported in the OPMX format.

ppl.nn by OpenPPL

Explore Similar Projects

dash-infer by modelscope

ScaleLLM by vectorch-ai

CoLLiE by OpenMOSS

Megatron-LLM by epfLLM

Telechat by Tele-AI

ZhiLight by zhihu

lightning-thunder by Lightning-AI

onnxruntime-genai by microsoft

ppq by OpenPPL

PINTO_model_zoo by PINTO0309

ludwig by ludwig-ai

models by onnx