OnnxStream by vitoplantamura

Lightweight C++ inference library for ONNX models, targeting low-resource devices

Created 2 years ago

2,015 stars

Top 21.8% on SourcePulse

View on GitHub

4 Experts Love This Project

Omar Sanseviero

DevRel at Google DeepMind

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Project Summary

OnnxStream is a C++ inference library designed for minimal memory footprint, enabling large AI models like Stable Diffusion XL and Mistral 7B to run on resource-constrained devices such as the Raspberry Pi Zero 2. It supports ARM, x86, WASM, and RISC-V architectures, accelerated by XNNPACK.

How It Works

OnnxStream decouples the inference engine from model weights via a WeightsProvider interface, allowing for flexible data loading (e.g., streaming from HTTP). It employs techniques like attention slicing and dynamic quantization to drastically reduce RAM usage, achieving up to 55x less memory consumption than OnnxRuntime for certain models, albeit with a potential latency increase.

Quick Start & Requirements

Build: Requires CMake and a compatible XNNPACK version (specific commit provided). Build commands for Linux, Mac, Windows, Termux, and FreeBSD are detailed.
Dependencies: XNNPACK is a core dependency. GPU support (cuBLAS) is available for LLM applications. WebAssembly builds are also supported.
Models: ONNX models need to be exported and converted using the provided onnx2txt.ipynb notebook. Pre-converted models for SD 1.5, SDXL 1.0, and SDXL Turbo are available on Hugging Face.
Resources: Running SD 1.5 on RPi Zero 2 requires ~260MB RAM (VAE quantized). SDXL 1.0 requires ~300MB RAM with tiled decoding.
Docs: onnx2txt.ipynb, Stable Diffusion example, custom model conversion

Highlighted Details

Achieves ~300MB RAM usage for SDXL 1.0 on RPi Zero 2 via tiled VAE decoding.
Supports dynamic quantization (8-bit) and attention slicing for memory reduction.
Offers WebAssembly builds for browser-based inference (e.g., Whisper, YOLOv8).
Initial GPU support via cuBLAS for LLM applications.

Maintenance & Community

Active development with recent updates for WebAssembly support and LLM applications.
Related projects include OnnxStreamGui, Auto epaper art, and PaperPiAI.

Licensing & Compatibility

The library itself is likely under a permissive license (e.g., MIT, Apache) given its nature, but the README does not explicitly state the license for vitoplantamura/OnnxStream. XNNPACK is typically Apache 2.0.

Limitations & Caveats

OnnxStream does not support inputs with a batch size other than 1.
Some ONNX operators might not be implemented yet (e.g., Einsum is noted as unsupported for custom SD 1.5 models).
The MAX_SPEED build option can increase memory usage during compilation and may cause issues on some platforms like Termux.
FreeBSD build requires specific CMake modifications due to XNNPACK limitations.

Health Check

Last Commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days