reverse-engineering-gemma-3n  by antimatter15

Reverse engineering Google's edge-optimized language model for local inference

Created 6 months ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository details the reverse-engineering efforts for Google's Gemma 3n, an "open" language model optimized for edge devices. It targets engineers and researchers seeking to understand and potentially replicate the model's novel memory-saving architectures, aiming to facilitate porting to popular inference frameworks like llama.cpp or Huggingface Transformers. The primary benefit is demystifying Google's proprietary implementation and enabling broader accessibility and modification.

How It Works

The project dissects Gemma 3n's LiteRT MediaPipe .task file, identified as a zip archive containing compiled TFLite model components. It leverages a tflite parsing library and large language models (Claude, Gemini) to interpret low-level opcodes and draft equivalent PyTorch code. Key architectural elements under investigation include tied embedding and LM head weights, a "per-layer embeddings" mechanism for significant RAM reduction during inference, and the use of LAuReL (Low Rank) blocks within transformer layers to decrease parameter count and computational cost.

Quick Start & Requirements

Highlighted Details

  • Per-Layer Embeddings: A core technique that halves inference memory requirements by loading token-specific embedding facets on-demand from flash memory, rather than keeping all parameters in RAM.
  • LAuReL (Low Rank): Implements a low-rank transformation within transformer layers, reducing parameter count and compute by approximately 16x compared to dense matrix multiplication.
  • Tied Weights: Identical weights are observed for the embedding model and the language model head, potentially indicating redundancy.
  • Multimodality: Includes components for vision processing, likely based on MobileNetV4, though audio weights are not yet released.

Maintenance & Community

This repository is a personal reverse-engineering project. No specific community channels (Discord, Slack), roadmap, or formal maintenance structure are detailed in the README. The author explicitly seeks community contributions to develop a runnable open-source implementation.

Licensing & Compatibility

The repository itself does not specify a license. Gemma 3n is described as "open" but distributed in a compiled .task format. Compatibility for commercial use or closed-source linking is not addressed.

Limitations & Caveats

This is an exploratory reverse-engineering effort, not a production-ready implementation. The provided code is largely drafted with LLM assistance and requires further development for execution. The vision components are less explored, and the audio capabilities are not yet available. The author acknowledges potential inaccuracies and encourages community collaboration for a complete, runnable port.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

0.9%
2k
Speculative decoding research paper for faster LLM inference
Created 2 years ago
Updated 1 week ago
Feedback? Help us improve.