gemma.cpp by google

C++ inference engine for Google's Gemma models

Created 1 year ago

6,660 stars

Top 7.6% on SourcePulse

View on GitHub

9 Experts Love This Project

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

Tim J. Baek

Founder of Open WebUI

Omar Sanseviero

DevRel at Google DeepMind

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

and 5 more!

Project Summary

This C++ inference engine provides a lightweight, standalone solution for running Google's Gemma, RecurrentGemma, and PaliGemma models. It targets researchers and developers needing direct control over LLM computation, offering a minimalist implementation (~2K LoC core) inspired by projects like ggml and llama.c. The engine leverages portable SIMD via the Google Highway Library for efficient CPU inference.

How It Works

The engine implements Gemma models using a direct C++ approach, avoiding the abstraction layers common in Python frameworks. It utilizes the Google Highway Library for SIMD acceleration, enabling efficient CPU-bound inference. Model weights are loaded in various formats, including 8-bit switched floating point (-sfp) for reduced memory and faster inference, or bfloat16 for higher fidelity.

Quick Start & Requirements

Install: Build using CMake (cmake -B build && cmake --build --preset make).
Prerequisites: CMake, Clang C++17 compiler, tar. Windows requires Visual Studio 2022 Build Tools with LLVM/Clang.
Model Weights: Download from Kaggle or Hugging Face Hub (e.g., gemma-2b-it-sfp).
Run: ./gemma --tokenizer <tokenizer_file> --weights <weights_file> --model <model_name>
Docs: ai.google.dev/gemma

Highlighted Details

Supports Gemma (1, 2, 3), RecurrentGemma, and PaliGemma (VLM) models.
Offers -sfp (8-bit switched floating point) weights for performance.
Can be incorporated as a library in CMake projects using FetchContent.
Includes a tool for migrating weights to a single-file format.

Maintenance & Community

Active development on the dev branch.
Community contributions welcome; Discord server available.
Key contributors include Austin Huang, Jan Wassenberg, Phil Culliton, Paul Chang, and Dan Zheng.

Licensing & Compatibility

License: Apache 2.0.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

Windows builds are recommended via WSL; native Windows support is being explored.
CLI usage is experimental and may have context length limitations.
Image reading for PaliGemma is basic, currently supporting only binary PPM (P6) format.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

36 stars in the last 30 days