tensorrt-cpp-api by cyrusbehr

High-performance GPU inference C++ library

Created 4 years ago

807 stars

Top 43.0% on SourcePulse

Project Summary

A modern, no-throw C++ library for high-performance GPU inference using NVIDIA TensorRT. It simplifies ONNX model compilation into optimized TensorRT engines, offering a clean C++ API with name-keyed tensors, caller-owned CUDA streams, explicit host/device transfers, and a robust Status/Result<T> error model. The library targets developers seeking efficient deep learning inference, with optional zero-copy Python bindings for seamless integration with Python ML ecosystems.

How It Works

The library abstracts TensorRT and CUDA complexities, providing an EngineBuilder for ONNX model compilation. A key feature is its robust engine caching mechanism, keyed by ONNX content hash, build options, TensorRT version, and GPU UUID, preventing silent misuse of stale caches. It supports dynamic shapes via per-input optimization profiles and enables thread-safe, multi-stream inference using an EnginePool, with each inference call managed by a caller-provided Stream. Public headers are deliberately free of TensorRT, OpenCV, or spdlog types, employing PImpl and generated headers for API purity. Optional fused preprocessing kernels and zero-copy Python bindings leverage __cuda_array_interface__ or DLPack for direct GPU memory access, releasing the GIL during inference.

Quick Start & Requirements

Primary install / run command:
- C++: cmake --install build --prefix /opt/trtcpp followed by find_package(tensorrt_cpp_api REQUIRED) in downstream projects.
- Python: pip install . (builds wheel via scikit-build-core).
Non-default prerequisites and dependencies: TensorRT (≥ 10), CUDA (12), C++20, Linux.
Links: docs/install.md for installation details; examples/ directory for runnable reference programs.

Highlighted Details

Safe Engine Cache: Rebuilds stale engines automatically if model, options, driver, or GPU changes.
Dynamic Shapes: Supports per-input min/opt/max optimization profiles and per-execution context optimization.
Zero-Copy Python Bindings: Integrates with CuPy/PyTorch/Numba via __cuda_array_interface__ or DLPack, minimizing host round-trips and releasing the GIL.
Fused Preprocessing: Optional single CUDA kernel for common image transformations (letterbox-resize, color space conversion, normalization, layout change).

Maintenance & Community

Contributors: Loic Tetrel, thomaskleiven, WiCyn.
Community: Connect on LinkedIn.
Development: Uses pre-commit hooks for formatting; CI includes build, CPU tests, sanitizers, and Python wheel builds.

Licensing & Compatibility

License type: Refer to the LICENSE file for specifics.
Compatibility notes: Primarily targets Linux, CUDA 12, and TensorRT ≥ 10. Windows support and LLM/transformer-specific features are out of scope.

Limitations & Caveats

The project is scoped to Linux environments, CUDA 12, TensorRT ≥ 10, and CNN-style vision models. Support for Windows and advanced features like LLM/transformer inference are explicitly not included.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days