local-gemma by huggingface

CLI tool for local Gemma-2 inference

Created 1 year ago

378 stars

Top 75.4% on SourcePulse

View on GitHub

4 Experts Love This Project

Lysandre Debut

Chief Open-Source Officer at Hugging Face

Omar Sanseviero

DevRel at Google DeepMind

Julien Chaumond

Cofounder of Hugging Face

Abubakar Abid

Cofounder of Gradio

Project Summary

This repository provides a streamlined method for running Google's Gemma-2 large language models locally on various hardware, targeting developers and researchers who need efficient on-device inference. It offers significant flexibility in balancing performance, memory usage, and accuracy through configurable presets.

How It Works

The project leverages 🤗 Transformers and bitsandbytes libraries to enable local execution of Gemma-2 models. It supports multiple hardware backends (CUDA, MPS, CPU) and offers optimization presets: exact for maximum accuracy, speed for enhanced throughput via torch.compile (CUDA only), memory for 4-bit quantization, and memory_extreme for minimal memory footprint with CPU offloading. This approach allows users to tailor the model's resource consumption and speed to their specific hardware and use case.

Quick Start & Requirements

Installation: pipx install local-gemma"[cuda]" (or "[mps]" or "[cpu]") for CLI, or pip install local-gemma"[cuda]" (or "[mps]" or "[cpu]") for Python API.
Prerequisites: Python 3.x. CUDA 12+ recommended for speed preset. Hugging Face read token required for model download.
Resources: Memory requirements vary by model size and preset, ranging from ~1.8GB (2b, memory_extreme) to 54.6GB (27b, exact).
Docs: Hugging Face Hub for model details.

Highlighted Details

speed preset offers up to 6x faster generation on CUDA via torch.compile.
memory_extreme preset reduces memory usage significantly, enabling large models on consumer hardware (e.g., 3.7GB for 9b model).
Supports interactive chat, single prompt execution, and piping command output as input.
Provides detailed benchmarks comparing presets on an A100 GPU.

Maintenance & Community

The project is built upon key libraries like Transformers, bitsandbytes, Quanto, and Accelerate. Specific contributors are not highlighted beyond acknowledgements.

Licensing & Compatibility

The project itself appears to be under a permissive license, but it relies on Gemma-2 models which have their own terms of use. Compatibility for commercial use depends on the underlying Gemma-2 model license.

Limitations & Caveats

The speed preset is CUDA-only. The memory_extreme preset utilizes CPU offloading via Accelerate, which may introduce latency. Users must accept Gemma-2 model terms on Hugging Face.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days