Native-LLM-for-Android  by DakeQQ

Native LLM inference for Android devices

Created 2 years ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

This project demonstrates running native Large Language Models (LLMs) directly on Android devices, offering on-device AI capabilities without cloud dependency. It targets developers and power users seeking to integrate LLMs into mobile applications, providing optimized performance for a variety of popular models.

How It Works

The core approach involves converting models from HuggingFace or ModelScope, optimizing them for extreme execution speed on mobile hardware. This process typically utilizes ONNX export, with a recommendation for dynamic axes and q4f32 quantization. Tokenizer files are sourced from the mnn-llm repository. The project supports various quantization methods and includes specific instructions for model parameter adjustments and low-memory loading modes.

Quick Start & Requirements

  1. Download desired models.
  2. Place model files into the assets folder.
  3. Decompress *.so files from the libs/arm64-v8a folder.
  4. For specific models like Qwen2VL/Qwen2.5VL, adjust key variables in GLRender.java and project.h.
  5. To enable low memory mode, set low_memory_mode = true in MainActivity.java.
  6. Model conversion and optimization involve Python scripts in the Export_ONNX folder and using onnxruntime.tools.convert_onnx_models_to_ort.

Highlighted Details

  • Model Support: Includes a wide array of models such as Qwen (0.6B-4B), Qwen-VL (2B-4B), Qwen2.5 (0.5B-3B), DeepSeek-R1-Distill-Qwen (1.5B), MiniCPM (1B-2.7B), Gemma-3-it (1B-4B), Phi-4-mini-Instruct (3.8B), Llama-3.2-Instruct (1B), InternVL-Mono (2B), InternLM-3 (8B), Seed-X (7B), and HunYuan (1.5B-7B).
  • Performance: Achieves notable inference speeds, e.g., Qwen3-1.7B (q4f32 dynamic) at 37 tokens/s on a Vivo x200 Pro (MediaTek 9400-CPU), and MiniCPM4-0.5B (q4f32) at 78 tokens/s on a Nubia Z50 (8 Gen 2-CPU). Performance varies by device, backend, and model quantization.
  • Optimization: Models are explicitly optimized for "extreme execution speed" on Android.
  • Features: Supports a low-memory loading mode for resource-constrained devices.

Maintenance & Community

The project shows recent activity with updates logged through early 2026, indicating ongoing development. No specific community links (e.g., Discord, Slack) or contributor details are provided in the README.

Licensing & Compatibility

License information is not specified in the provided README content.

Limitations & Caveats

Input and output behavior may differ slightly from the original HuggingFace or ModelScope models due to optimization and conversion processes. Specific parameter adjustments are required for certain model families (e.g., Qwen2VL/Qwen2.5VL).

Health Check
Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Gabriel Almeida Gabriel Almeida(Cofounder of Langflow), and
2 more.

torchchat by pytorch

0.0%
4k
PyTorch-native SDK for local LLM inference across diverse platforms
Created 2 years ago
Updated 8 months ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
4k
Weight quantization research paper for LLM compression/acceleration
Created 3 years ago
Updated 10 months ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

1.1%
18k
Inference optimization for LLMs on low-resource hardware
Created 3 years ago
Updated 2 months ago
Feedback? Help us improve.