CLI tool for local Gemma-2 inference
Top 76.8% on sourcepulse
This repository provides a streamlined method for running Google's Gemma-2 large language models locally on various hardware, targeting developers and researchers who need efficient on-device inference. It offers significant flexibility in balancing performance, memory usage, and accuracy through configurable presets.
How It Works
The project leverages 🤗 Transformers and bitsandbytes libraries to enable local execution of Gemma-2 models. It supports multiple hardware backends (CUDA, MPS, CPU) and offers optimization presets: exact
for maximum accuracy, speed
for enhanced throughput via torch.compile
(CUDA only), memory
for 4-bit quantization, and memory_extreme
for minimal memory footprint with CPU offloading. This approach allows users to tailor the model's resource consumption and speed to their specific hardware and use case.
Quick Start & Requirements
pipx install local-gemma"[cuda]"
(or "[mps]"
or "[cpu]"
) for CLI, or pip install local-gemma"[cuda]"
(or "[mps]"
or "[cpu]"
) for Python API.speed
preset. Hugging Face read token required for model download.2b
, memory_extreme
) to 54.6GB (27b
, exact
).Highlighted Details
speed
preset offers up to 6x faster generation on CUDA via torch.compile
.memory_extreme
preset reduces memory usage significantly, enabling large models on consumer hardware (e.g., 3.7GB for 9b model).Maintenance & Community
The project is built upon key libraries like Transformers, bitsandbytes, Quanto, and Accelerate. Specific contributors are not highlighted beyond acknowledgements.
Licensing & Compatibility
The project itself appears to be under a permissive license, but it relies on Gemma-2 models which have their own terms of use. Compatibility for commercial use depends on the underlying Gemma-2 model license.
Limitations & Caveats
The speed
preset is CUDA-only. The memory_extreme
preset utilizes CPU offloading via Accelerate, which may introduce latency. Users must accept Gemma-2 model terms on Hugging Face.
1 year ago
1 day