C/C++ inference engine for Vision Transformer (ViT) models
Top 91.9% on sourcepulse
This project provides a C/C++ inference engine for Vision Transformer (ViT) models, leveraging the ggml library for optimized performance on edge devices. It targets developers and researchers needing a lightweight, dependency-free solution for ViT inference, offering significantly faster startup times and lower memory footprints compared to traditional deep learning frameworks.
How It Works
The implementation is a direct C/C++ translation of the ViT architecture, utilizing ggml for efficient tensor operations and memory management. This approach enables aggressive quantization (4-bit, 5-bit, 8-bit) and allows for per-device optimizations via compiler flags like -march=native
, leading to substantial speedups and reduced memory usage, particularly on CPUs.
Quick Start & Requirements
git clone --recurse-submodules
), install Python dependencies (pip install torch timm
), convert PyTorch models to GGUF format using convert-pth-to-ggml.py
, build the C++ inference engine (mkdir build && cd build && cmake .. && make -j4
)../bin/vit -t <threads> -m <model_path.gguf> -i <image_path.jpeg>
Highlighted Details
Maintenance & Community
The project is inspired by whisper.cpp
and llama.cpp
, suggesting a community familiar with these highly successful projects. No specific community links (Discord/Slack) or active maintainer information are provided in the README.
Licensing & Compatibility
The repository does not explicitly state a license in the README. This requires further investigation for commercial use or integration into closed-source projects.
Limitations & Caveats
The README does not specify the exact range of ViT variants supported beyond "timm ViTs with different variants out of the box." Evaluation on standard datasets like ImageNet1k is listed as a "To-Do," indicating ongoing development and potential accuracy validation needs.
1 year ago
1 day