ppl.nn  by OpenPPL

Inference engine for efficient AI inferencing

Created 4 years ago
1,355 stars

Top 29.7% on SourcePulse

GitHubView on GitHub
Project Summary

PPLNN is a high-performance deep-learning inference engine designed for efficient AI inferencing, particularly for Large Language Models (LLMs). It supports various ONNX models and offers enhanced compatibility with the OpenMMLab ecosystem, targeting developers and researchers needing optimized inference performance.

How It Works

PPLNN leverages a primitive library approach for neural networks, focusing on efficient execution of ONNX models. It incorporates advanced LLM features like Flash Attention, Split-k Attention, Group-query Attention, and dynamic batching. The engine also supports Tensor Parallelism and graph optimizations, including INT8 quantization for KV Cache and weights/activations, aiming for near FP16 accuracy with reduced memory footprint and increased speed.

Quick Start & Requirements

  • Install: Clone the repository and build from source using ./build.sh -DPPLNN_USE_X86_64=ON -DPPLNN_ENABLE_PYTHON_API=ON.
  • Prerequisites: build-essential, cmake, git, python3, python3-dev (Debian/Ubuntu) or gcc, gcc-c++, cmake3, make, git, python3, python3-devel (RedHat/CentOS).
  • Demo: PYTHONPATH=./pplnn-build/install/lib python3 ./tools/pplnn.py --use-x86 --onnx-model tests/testdata/conv.onnx
  • Docs: Building from Source, API Reference

Highlighted Details

  • Supports LLaMA, ChatGLM, Baichuan, InternLM, Mixtral, Qwen, Falcon, and Bigcode models.
  • Offers INT8 groupwise KV Cache and INT8 per-token per-channel quantization.
  • Includes Flash Attention and Split-k Attention for improved LLM performance.
  • Supports X86, CUDA, RISCV, and ARM architectures.

Maintenance & Community

  • Active development with a focus on LLM optimizations.
  • Community support via GitHub Issues and a QQ group (627853444).

Licensing & Compatibility

  • Distributed under the Apache License, Version 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

A known issue exists with NCCL AllReduce on L40S and H800 GPUs, which can be mitigated by setting NCCL_PROTO=^Simple. ChatGLM1 is no longer supported in the OPMX format.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.