C++ inference engine for Google's Gemma models
Top 8.0% on sourcepulse
This C++ inference engine provides a lightweight, standalone solution for running Google's Gemma, RecurrentGemma, and PaliGemma models. It targets researchers and developers needing direct control over LLM computation, offering a minimalist implementation (~2K LoC core) inspired by projects like ggml
and llama.c
. The engine leverages portable SIMD via the Google Highway Library for efficient CPU inference.
How It Works
The engine implements Gemma models using a direct C++ approach, avoiding the abstraction layers common in Python frameworks. It utilizes the Google Highway Library for SIMD acceleration, enabling efficient CPU-bound inference. Model weights are loaded in various formats, including 8-bit switched floating point (-sfp
) for reduced memory and faster inference, or bfloat16
for higher fidelity.
Quick Start & Requirements
cmake -B build && cmake --build --preset make
).tar
. Windows requires Visual Studio 2022 Build Tools with LLVM/Clang.gemma-2b-it-sfp
)../gemma --tokenizer <tokenizer_file> --weights <weights_file> --model <model_name>
Highlighted Details
-sfp
(8-bit switched floating point) weights for performance.FetchContent
.Maintenance & Community
dev
branch.Licensing & Compatibility
Limitations & Caveats
1 day ago
Inactive