Inference engine for efficient AI inferencing
Top 30.5% on sourcepulse
PPLNN is a high-performance deep-learning inference engine designed for efficient AI inferencing, particularly for Large Language Models (LLMs). It supports various ONNX models and offers enhanced compatibility with the OpenMMLab ecosystem, targeting developers and researchers needing optimized inference performance.
How It Works
PPLNN leverages a primitive library approach for neural networks, focusing on efficient execution of ONNX models. It incorporates advanced LLM features like Flash Attention, Split-k Attention, Group-query Attention, and dynamic batching. The engine also supports Tensor Parallelism and graph optimizations, including INT8 quantization for KV Cache and weights/activations, aiming for near FP16 accuracy with reduced memory footprint and increased speed.
Quick Start & Requirements
./build.sh -DPPLNN_USE_X86_64=ON -DPPLNN_ENABLE_PYTHON_API=ON
.build-essential
, cmake
, git
, python3
, python3-dev
(Debian/Ubuntu) or gcc
, gcc-c++
, cmake3
, make
, git
, python3
, python3-devel
(RedHat/CentOS).PYTHONPATH=./pplnn-build/install/lib python3 ./tools/pplnn.py --use-x86 --onnx-model tests/testdata/conv.onnx
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
A known issue exists with NCCL AllReduce on L40S and H800 GPUs, which can be mitigated by setting NCCL_PROTO=^Simple
. ChatGLM1 is no longer supported in the OPMX format.
8 months ago
1 day