picolm  by RightNow-AI

Ultra-lightweight LLM inference for embedded systems

Created 1 week ago

New!

750 stars

Top 46.3% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> PicoLM addresses the challenge of running Large Language Models (LLMs) on extremely resource-constrained hardware, such as low-cost single-board computers with minimal RAM. It targets developers and hobbyists seeking fully offline, private AI capabilities, offering a significant benefit by enabling powerful LLM inference without cloud dependencies or expensive hardware.

How It Works

This project is a pure C11 LLM inference engine designed for minimal footprint. It leverages memory-mapping (mmap) to keep the model weights on disk, streaming only necessary layers into RAM. Combined with 4-bit quantization (Q4_K_M), an FP16 KV cache, and optimizations like Flash Attention and SIMD acceleration (ARM NEON, x86 SSE2), PicoLM achieves ~45MB RAM usage for a 1.1B parameter model. Its core advantage lies in enabling LLM inference on hardware previously considered incapable, with a single binary and zero external dependencies.

Quick Start & Requirements

  • Installation: A one-liner script (curl ... | bash) automates dependency installation, PicoLM build, model download, and PicoClaw configuration. Alternatively, build from source via git clone and make native.
  • Prerequisites: Linux/Pi requires gcc and make; macOS requires Xcode Command Line Tools; Windows requires Visual Studio Build Tools. A model file (e.g., TinyLlama 1.1B Q4_K_M, ~638MB) must be downloaded.
  • Resource Footprint: TinyLlama 1.1B requires ~45MB RAM; the binary is ~80KB.
  • Links: install.sh, Technical Blog.

Highlighted Details

  • Native support for LLaMA-architecture models in GGUF format.
  • Extensive quantization support (Q2_K to F32).
  • Memory-mapped layer streaming and FP16 KV cache minimize RAM footprint.
  • Optimizations include Flash Attention, pre-computed RoPE, fused operations, and SIMD (NEON/SSE2).
  • Grammar-constrained JSON output mode ensures valid structured data for tool calling.
  • KV cache persistence significantly speeds up repeated prompts.
  • Zero external dependencies beyond standard C libraries.

Maintenance & Community

The project includes a roadmap outlining future development directions such as AVX2/AVX-512 support and speculative decoding. No specific community channels (like Discord/Slack) or contributor details are provided in the README.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: The permissive MIT license allows for commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

PicoLM is strictly CPU-bound, lacking GPU acceleration. It supports only LLaMA-architecture GGUF models, and performance/quality is inherently limited by the chosen model size and the target hardware's capabilities. While versatile, its design is heavily influenced by its integration with the PicoClaw offline AI assistant.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
5
Issues (30d)
8
Star History
808 stars in the last 7 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
12 more.

mistral.rs by EricLBuehler

0.3%
7k
LLM inference engine for blazing fast performance
Created 2 years ago
Updated 6 days ago
Starred by Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

MiniCPM by OpenBMB

0.3%
9k
Ultra-efficient LLMs for end devices, achieving 5x+ speedup
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.