reverse-engineering-gemma-3n by antimatter15

Reverse engineering Google's edge-optimized language model for local inference

Created 9 months ago

267 stars

Top 96.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

This repository details the reverse-engineering efforts for Google's Gemma 3n, an "open" language model optimized for edge devices. It targets engineers and researchers seeking to understand and potentially replicate the model's novel memory-saving architectures, aiming to facilitate porting to popular inference frameworks like llama.cpp or Huggingface Transformers. The primary benefit is demystifying Google's proprietary implementation and enabling broader accessibility and modification.

How It Works

The project dissects Gemma 3n's LiteRT MediaPipe .task file, identified as a zip archive containing compiled TFLite model components. It leverages a tflite parsing library and large language models (Claude, Gemini) to interpret low-level opcodes and draft equivalent PyTorch code. Key architectural elements under investigation include tied embedding and LM head weights, a "per-layer embeddings" mechanism for significant RAM reduction during inference, and the use of LAuReL (Low Rank) blocks within transformer layers to decrease parameter count and computational cost.

Quick Start & Requirements

Primary Action: Download the Gemma 3n .task file from Hugging Face (e.g., google/gemma-3n-E4B-it-litert-preview).
Prerequisites: Python environment, PyTorch (for drafted implementations), access to LLMs for code interpretation assistance. The repository contains scripts and drafted code rather than a direct installation package.
Links:
- Gemma 3n Announcement: https://ai.google.dev/gemma/docs/gemma-3n
- MatFormer Research: https://arxiv.org/pdf/2310.07707
- TFLite Parsing Library: https://github.com/zhenhuaw-me/tflite
- LAuReL Research: https://arxiv.org/abs/2411.07501
- AltUp Research: https://arxiv.org/abs/2301.13310

Highlighted Details

Per-Layer Embeddings: A core technique that halves inference memory requirements by loading token-specific embedding facets on-demand from flash memory, rather than keeping all parameters in RAM.
LAuReL (Low Rank): Implements a low-rank transformation within transformer layers, reducing parameter count and compute by approximately 16x compared to dense matrix multiplication.
Tied Weights: Identical weights are observed for the embedding model and the language model head, potentially indicating redundancy.
Multimodality: Includes components for vision processing, likely based on MobileNetV4, though audio weights are not yet released.

Maintenance & Community

This repository is a personal reverse-engineering project. No specific community channels (Discord, Slack), roadmap, or formal maintenance structure are detailed in the README. The author explicitly seeks community contributions to develop a runnable open-source implementation.

Licensing & Compatibility

The repository itself does not specify a license. Gemma 3n is described as "open" but distributed in a compiled .task format. Compatibility for commercial use or closed-source linking is not addressed.

Limitations & Caveats

This is an exploratory reverse-engineering effort, not a production-ready implementation. The provided code is largely drafted with LLM assistance and requires further development for execution. The vision components are less explored, and the audio capabilities are not yet available. The author acknowledges potential inaccuracies and encourages community collaboration for a complete, runnable port.

reverse-engineering-gemma-3n by antimatter15

Explore Similar Projects

cobra by h-zhao1997

torchless by ryanssenn

LLaDA2.X by inclusionAI

vit.cpp by staghado

VLM2Vec by TIGER-AI-Lab

Anemll by Anemll

MobileLLM by facebookresearch

EAGLE by SafeAILab

MGM by JIA-Lab-research

Awesome-LLM-Inference by xlite-dev

text-embeddings-inference by huggingface

starcoder by bigcode-project