picolm by RightNow-AI

Ultra-lightweight LLM inference for embedded systems

Created 1 week ago

New!

750 stars

Top 46.3% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> PicoLM addresses the challenge of running Large Language Models (LLMs) on extremely resource-constrained hardware, such as low-cost single-board computers with minimal RAM. It targets developers and hobbyists seeking fully offline, private AI capabilities, offering a significant benefit by enabling powerful LLM inference without cloud dependencies or expensive hardware.

How It Works

This project is a pure C11 LLM inference engine designed for minimal footprint. It leverages memory-mapping (mmap) to keep the model weights on disk, streaming only necessary layers into RAM. Combined with 4-bit quantization (Q4_K_M), an FP16 KV cache, and optimizations like Flash Attention and SIMD acceleration (ARM NEON, x86 SSE2), PicoLM achieves ~45MB RAM usage for a 1.1B parameter model. Its core advantage lies in enabling LLM inference on hardware previously considered incapable, with a single binary and zero external dependencies.

Quick Start & Requirements

Installation: A one-liner script (curl ... | bash) automates dependency installation, PicoLM build, model download, and PicoClaw configuration. Alternatively, build from source via git clone and make native.
Prerequisites: Linux/Pi requires gcc and make; macOS requires Xcode Command Line Tools; Windows requires Visual Studio Build Tools. A model file (e.g., TinyLlama 1.1B Q4_K_M, ~638MB) must be downloaded.
Resource Footprint: TinyLlama 1.1B requires ~45MB RAM; the binary is ~80KB.
Links: install.sh, Technical Blog.

Highlighted Details

Native support for LLaMA-architecture models in GGUF format.
Extensive quantization support (Q2_K to F32).
Memory-mapped layer streaming and FP16 KV cache minimize RAM footprint.
Optimizations include Flash Attention, pre-computed RoPE, fused operations, and SIMD (NEON/SSE2).
Grammar-constrained JSON output mode ensures valid structured data for tool calling.
KV cache persistence significantly speeds up repeated prompts.
Zero external dependencies beyond standard C libraries.

Maintenance & Community

The project includes a roadmap outlining future development directions such as AVX2/AVX-512 support and speculative decoding. No specific community channels (like Discord/Slack) or contributor details are provided in the README.

Licensing & Compatibility

License: MIT License.
Compatibility: The permissive MIT license allows for commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

PicoLM is strictly CPU-bound, lacking GPU acceleration. It supports only LLaMA-architecture GGUF models, and performance/quality is inherently limited by the chosen model size and the target hardware's capabilities. While versatile, its design is heavily influenced by its integration with the PicoClaw offline AI assistant.

picolm by RightNow-AI

Explore Similar Projects

crabml by crabml

Nanoflow by efeslab

vllm-metal by vllm-project

sarathi-serve by microsoft

prima.cpp by Lizonghang

xFasterTransformer by intel

LLM-Viewer by hahnyuan

LiteRT-LM by google-ai-edge

ollm by Mega4alik

distributed-llama by b4rtaz

mistral.rs by EricLBuehler

MiniCPM by OpenBMB