ppl.nn  by OpenPPL

Inference engine for efficient AI inferencing

created 4 years ago
1,346 stars

Top 30.5% on sourcepulse

GitHubView on GitHub
Project Summary

PPLNN is a high-performance deep-learning inference engine designed for efficient AI inferencing, particularly for Large Language Models (LLMs). It supports various ONNX models and offers enhanced compatibility with the OpenMMLab ecosystem, targeting developers and researchers needing optimized inference performance.

How It Works

PPLNN leverages a primitive library approach for neural networks, focusing on efficient execution of ONNX models. It incorporates advanced LLM features like Flash Attention, Split-k Attention, Group-query Attention, and dynamic batching. The engine also supports Tensor Parallelism and graph optimizations, including INT8 quantization for KV Cache and weights/activations, aiming for near FP16 accuracy with reduced memory footprint and increased speed.

Quick Start & Requirements

  • Install: Clone the repository and build from source using ./build.sh -DPPLNN_USE_X86_64=ON -DPPLNN_ENABLE_PYTHON_API=ON.
  • Prerequisites: build-essential, cmake, git, python3, python3-dev (Debian/Ubuntu) or gcc, gcc-c++, cmake3, make, git, python3, python3-devel (RedHat/CentOS).
  • Demo: PYTHONPATH=./pplnn-build/install/lib python3 ./tools/pplnn.py --use-x86 --onnx-model tests/testdata/conv.onnx
  • Docs: Building from Source, API Reference

Highlighted Details

  • Supports LLaMA, ChatGLM, Baichuan, InternLM, Mixtral, Qwen, Falcon, and Bigcode models.
  • Offers INT8 groupwise KV Cache and INT8 per-token per-channel quantization.
  • Includes Flash Attention and Split-k Attention for improved LLM performance.
  • Supports X86, CUDA, RISCV, and ARM architectures.

Maintenance & Community

  • Active development with a focus on LLM optimizations.
  • Community support via GitHub Issues and a QQ group (627853444).

Licensing & Compatibility

  • Distributed under the Apache License, Version 2.0.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

A known issue exists with NCCL AllReduce on L40S and H800 GPUs, which can be mitigated by setting NCCL_PROTO=^Simple. ChatGLM1 is no longer supported in the OPMX format.

Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Wei-Lin Chiang Wei-Lin Chiang(Cofounder of LMArena), and
1 more.

deepsparse by neuralmagic

0%
3k
CPU inference runtime for sparse deep learning models
created 4 years ago
updated 2 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
2 more.

LightLLM by ModelTC

0.8%
3k
Python framework for LLM inference and serving
created 2 years ago
updated 9 hours ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 20 hours ago
Starred by Lewis Tunstall Lewis Tunstall(Researcher at Hugging Face), Robert Nishihara Robert Nishihara(Cofounder of Anyscale; Author of Ray), and
4 more.

verl by volcengine

2.3%
12k
RL training library for LLMs
created 9 months ago
updated 3 hours ago
Feedback? Help us improve.