local-gemma  by huggingface

CLI tool for local Gemma-2 inference

created 1 year ago
375 stars

Top 76.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a streamlined method for running Google's Gemma-2 large language models locally on various hardware, targeting developers and researchers who need efficient on-device inference. It offers significant flexibility in balancing performance, memory usage, and accuracy through configurable presets.

How It Works

The project leverages 🤗 Transformers and bitsandbytes libraries to enable local execution of Gemma-2 models. It supports multiple hardware backends (CUDA, MPS, CPU) and offers optimization presets: exact for maximum accuracy, speed for enhanced throughput via torch.compile (CUDA only), memory for 4-bit quantization, and memory_extreme for minimal memory footprint with CPU offloading. This approach allows users to tailor the model's resource consumption and speed to their specific hardware and use case.

Quick Start & Requirements

  • Installation: pipx install local-gemma"[cuda]" (or "[mps]" or "[cpu]") for CLI, or pip install local-gemma"[cuda]" (or "[mps]" or "[cpu]") for Python API.
  • Prerequisites: Python 3.x. CUDA 12+ recommended for speed preset. Hugging Face read token required for model download.
  • Resources: Memory requirements vary by model size and preset, ranging from ~1.8GB (2b, memory_extreme) to 54.6GB (27b, exact).
  • Docs: Hugging Face Hub for model details.

Highlighted Details

  • speed preset offers up to 6x faster generation on CUDA via torch.compile.
  • memory_extreme preset reduces memory usage significantly, enabling large models on consumer hardware (e.g., 3.7GB for 9b model).
  • Supports interactive chat, single prompt execution, and piping command output as input.
  • Provides detailed benchmarks comparing presets on an A100 GPU.

Maintenance & Community

The project is built upon key libraries like Transformers, bitsandbytes, Quanto, and Accelerate. Specific contributors are not highlighted beyond acknowledgements.

Licensing & Compatibility

The project itself appears to be under a permissive license, but it relies on Gemma-2 models which have their own terms of use. Compatibility for commercial use depends on the underlying Gemma-2 model license.

Limitations & Caveats

The speed preset is CUDA-only. The memory_extreme preset utilizes CPU offloading via Accelerate, which may introduce latency. Users must accept Gemma-2 model terms on Hugging Face.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 9 months ago
updated 23 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
2 more.

gemma_pytorch by google

0.1%
6k
PyTorch implementation for Google's Gemma models
created 1 year ago
updated 2 months ago
Feedback? Help us improve.