Native-LLM-for-Android by DakeQQ

Native LLM inference for Android devices

Created 2 years ago

256 stars

Top 98.5% on SourcePulse

Project Summary

This project demonstrates running native Large Language Models (LLMs) directly on Android devices, offering on-device AI capabilities without cloud dependency. It targets developers and power users seeking to integrate LLMs into mobile applications, providing optimized performance for a variety of popular models.

How It Works

The core approach involves converting models from HuggingFace or ModelScope, optimizing them for extreme execution speed on mobile hardware. This process typically utilizes ONNX export, with a recommendation for dynamic axes and q4f32 quantization. Tokenizer files are sourced from the mnn-llm repository. The project supports various quantization methods and includes specific instructions for model parameter adjustments and low-memory loading modes.

Quick Start & Requirements

Download desired models.
Place model files into the assets folder.
Decompress *.so files from the libs/arm64-v8a folder.
For specific models like Qwen2VL/Qwen2.5VL, adjust key variables in GLRender.java and project.h.
To enable low memory mode, set low_memory_mode = true in MainActivity.java.
Model conversion and optimization involve Python scripts in the Export_ONNX folder and using onnxruntime.tools.convert_onnx_models_to_ort.

Highlighted Details

Model Support: Includes a wide array of models such as Qwen (0.6B-4B), Qwen-VL (2B-4B), Qwen2.5 (0.5B-3B), DeepSeek-R1-Distill-Qwen (1.5B), MiniCPM (1B-2.7B), Gemma-3-it (1B-4B), Phi-4-mini-Instruct (3.8B), Llama-3.2-Instruct (1B), InternVL-Mono (2B), InternLM-3 (8B), Seed-X (7B), and HunYuan (1.5B-7B).
Performance: Achieves notable inference speeds, e.g., Qwen3-1.7B (q4f32 dynamic) at 37 tokens/s on a Vivo x200 Pro (MediaTek 9400-CPU), and MiniCPM4-0.5B (q4f32) at 78 tokens/s on a Nubia Z50 (8 Gen 2-CPU). Performance varies by device, backend, and model quantization.
Optimization: Models are explicitly optimized for "extreme execution speed" on Android.
Features: Supports a low-memory loading mode for resource-constrained devices.

Maintenance & Community

The project shows recent activity with updates logged through early 2026, indicating ongoing development. No specific community links (e.g., Discord, Slack) or contributor details are provided in the README.

Licensing & Compatibility

License information is not specified in the provided README content.

Limitations & Caveats

Input and output behavior may differ slightly from the original HuggingFace or ModelScope models due to optimization and conversion processes. Specific parameter adjustments are required for certain model families (e.g., Qwen2VL/Qwen2.5VL).

Native-LLM-for-Android by DakeQQ

Explore Similar Projects

Nano by bd4sur

picollm by Picovoice

Kolosal by KolosalAI

llama.cpp-deepseek-v4-flash by antirez

buun-llama-cpp by spiritbuun

InferLLM by MegEngine

Awesome-LLMs-on-device by NexaAI

mllm by UbiquitousLearning

distributed-llama by b4rtaz

llm-awq by mit-han-lab

mlx-lm by ml-explore

airllm by lyogavin