local-gemma  by huggingface

CLI tool for local Gemma-2 inference

Created 1 year ago
376 stars

Top 75.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a streamlined method for running Google's Gemma-2 large language models locally on various hardware, targeting developers and researchers who need efficient on-device inference. It offers significant flexibility in balancing performance, memory usage, and accuracy through configurable presets.

How It Works

The project leverages 🤗 Transformers and bitsandbytes libraries to enable local execution of Gemma-2 models. It supports multiple hardware backends (CUDA, MPS, CPU) and offers optimization presets: exact for maximum accuracy, speed for enhanced throughput via torch.compile (CUDA only), memory for 4-bit quantization, and memory_extreme for minimal memory footprint with CPU offloading. This approach allows users to tailor the model's resource consumption and speed to their specific hardware and use case.

Quick Start & Requirements

  • Installation: pipx install local-gemma"[cuda]" (or "[mps]" or "[cpu]") for CLI, or pip install local-gemma"[cuda]" (or "[mps]" or "[cpu]") for Python API.
  • Prerequisites: Python 3.x. CUDA 12+ recommended for speed preset. Hugging Face read token required for model download.
  • Resources: Memory requirements vary by model size and preset, ranging from ~1.8GB (2b, memory_extreme) to 54.6GB (27b, exact).
  • Docs: Hugging Face Hub for model details.

Highlighted Details

  • speed preset offers up to 6x faster generation on CUDA via torch.compile.
  • memory_extreme preset reduces memory usage significantly, enabling large models on consumer hardware (e.g., 3.7GB for 9b model).
  • Supports interactive chat, single prompt execution, and piping command output as input.
  • Provides detailed benchmarks comparing presets on an A100 GPU.

Maintenance & Community

The project is built upon key libraries like Transformers, bitsandbytes, Quanto, and Accelerate. Specific contributors are not highlighted beyond acknowledgements.

Licensing & Compatibility

The project itself appears to be under a permissive license, but it relies on Gemma-2 models which have their own terms of use. Compatibility for commercial use depends on the underlying Gemma-2 model license.

Limitations & Caveats

The speed preset is CUDA-only. The memory_extreme preset utilizes CPU offloading via Accelerate, which may introduce latency. Users must accept Gemma-2 model terms on Hugging Face.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tim J. Baek Tim J. Baek(Founder of Open WebUI), and
7 more.

gemma.cpp by google

0.1%
7k
C++ inference engine for Google's Gemma models
Created 1 year ago
Updated 1 day ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
36 more.

unsloth by unslothai

0.6%
46k
Finetuning tool for LLMs, targeting speed and memory efficiency
Created 1 year ago
Updated 13 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
54 more.

llama.cpp by ggml-org

0.4%
87k
C/C++ library for local LLM inference
Created 2 years ago
Updated 13 hours ago
Feedback? Help us improve.