gemma.cpp  by google

C++ inference engine for Google's Gemma models

Created 1 year ago
6,566 stars

Top 7.8% on SourcePulse

GitHubView on GitHub
Project Summary

This C++ inference engine provides a lightweight, standalone solution for running Google's Gemma, RecurrentGemma, and PaliGemma models. It targets researchers and developers needing direct control over LLM computation, offering a minimalist implementation (~2K LoC core) inspired by projects like ggml and llama.c. The engine leverages portable SIMD via the Google Highway Library for efficient CPU inference.

How It Works

The engine implements Gemma models using a direct C++ approach, avoiding the abstraction layers common in Python frameworks. It utilizes the Google Highway Library for SIMD acceleration, enabling efficient CPU-bound inference. Model weights are loaded in various formats, including 8-bit switched floating point (-sfp) for reduced memory and faster inference, or bfloat16 for higher fidelity.

Quick Start & Requirements

  • Install: Build using CMake (cmake -B build && cmake --build --preset make).
  • Prerequisites: CMake, Clang C++17 compiler, tar. Windows requires Visual Studio 2022 Build Tools with LLVM/Clang.
  • Model Weights: Download from Kaggle or Hugging Face Hub (e.g., gemma-2b-it-sfp).
  • Run: ./gemma --tokenizer <tokenizer_file> --weights <weights_file> --model <model_name>
  • Docs: ai.google.dev/gemma

Highlighted Details

  • Supports Gemma (1, 2, 3), RecurrentGemma, and PaliGemma (VLM) models.
  • Offers -sfp (8-bit switched floating point) weights for performance.
  • Can be incorporated as a library in CMake projects using FetchContent.
  • Includes a tool for migrating weights to a single-file format.

Maintenance & Community

  • Active development on the dev branch.
  • Community contributions welcome; Discord server available.
  • Key contributors include Austin Huang, Jan Wassenberg, Phil Culliton, Paul Chang, and Dan Zheng.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Windows builds are recommended via WSL; native Windows support is being explored.
  • CLI usage is experimental and may have context length limitations.
  • Image reading for PaliGemma is basic, currently supporting only binary PPM (P6) format.
Health Check
Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
48
Issues (30d)
3
Star History
39 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dots.llm1 by rednote-hilab

0.2%
462
MoE model for research
Created 4 months ago
Updated 4 weeks ago
Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
2 more.

local-gemma by huggingface

0.3%
376
CLI tool for local Gemma-2 inference
Created 1 year ago
Updated 1 year ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.2%
6k
PyTorch implementation for Google's Gemma models
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.