Discover and explore top open-source AI tools and projects—updated daily.
googleC++ inference engine for Google's Gemma models
Top 7.7% on SourcePulse
This C++ inference engine provides a lightweight, standalone solution for running Google's Gemma, RecurrentGemma, and PaliGemma models. It targets researchers and developers needing direct control over LLM computation, offering a minimalist implementation (~2K LoC core) inspired by projects like ggml and llama.c. The engine leverages portable SIMD via the Google Highway Library for efficient CPU inference.
How It Works
The engine implements Gemma models using a direct C++ approach, avoiding the abstraction layers common in Python frameworks. It utilizes the Google Highway Library for SIMD acceleration, enabling efficient CPU-bound inference. Model weights are loaded in various formats, including 8-bit switched floating point (-sfp) for reduced memory and faster inference, or bfloat16 for higher fidelity.
Quick Start & Requirements
cmake -B build && cmake --build --preset make).tar. Windows requires Visual Studio 2022 Build Tools with LLVM/Clang.gemma-2b-it-sfp)../gemma --tokenizer <tokenizer_file> --weights <weights_file> --model <model_name>Highlighted Details
-sfp (8-bit switched floating point) weights for performance.FetchContent.Maintenance & Community
dev branch.Licensing & Compatibility
Limitations & Caveats
1 day ago
1 day
huggingface
SafeAILab
google
GeeeekExplorer