picollm by Picovoice

On-device LLM inference SDK for quantized models

Created 1 year ago

274 stars

Top 94.4% on SourcePulse

Project Summary

picoLLM is an on-device inference engine for large language models (LLMs), designed for high accuracy and privacy. It targets developers and researchers needing to run LLMs locally across various platforms, including desktops, mobile, and web browsers, with a key benefit being its efficient X-Bit quantization for reduced resource consumption.

How It Works

picoLLM utilizes a novel quantization algorithm called picoLLM Compression, which automatically optimizes bit allocation across LLM weights based on a task-specific cost function. This adaptive approach surpasses fixed-bit schemes like GPTQ, recovering significantly more accuracy (e.g., 91-100% MMLU score recovery at 2-4 bits for Llama-3-8b). Inference runs entirely locally, ensuring data privacy.

Quick Start & Requirements

Install: pip3 install picollm (Python), dotnet add package PicoLLM (.NET), yarn add @picovoice/picollm-node (Node.js), etc.
Prerequisites: An AccessKey from Picovoice Console is required for authentication, though inference is free for open-weight models. Model files must be downloaded separately from the Picovoice Console.
Demos: Available for Python, .NET, Node.js, Android, iOS, Web, and C.
Docs: https://picovoice.ai/docs/picollm/

Highlighted Details

Supports a wide range of open-weight models including Gemma, Llama (2 & 3), Mistral, Mixtral, Phi-2, and Phi-3.
Cross-platform compatibility: Linux, macOS, Windows, Raspberry Pi (4 & 5), Android, iOS, and major web browsers.
Runs on both CPU and GPU.
Offers an interrupt() function for halting generation.

Maintenance & Community

Actively maintained by Picovoice. Recent releases include performance improvements and support for new models like Phi-3.5.
No explicit community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

The SDKs and inference engine are free for open-weight models.
An AccessKey is required, implying a licensing/authentication mechanism tied to Picovoice services, even for offline inference.

Limitations & Caveats

An internet connection is required to validate the AccessKey, even though inference is offline.
Model files must be downloaded separately from the Picovoice Console.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

16

Issues (30d)

0

Star History

1 stars in the last 30 days

Explore Similar Projects

Kolosal by KolosalAI

Desktop app for local LLM training and inference

Created 1 year ago

Updated 7 months ago

candle-vllm by EricLBuehler

Platform for local LLM inference and serving with OpenAI API compatibility

Created 2 years ago

Updated 3 days ago

LightCompress by ModelTC

PyTorch toolkit for LLM quantization research and deployment

Created 1 year ago

Updated 1 month ago

Starred by

Salvatore Sanfilippo

Salvatore Sanfilippo(Author of Redis).

prima.cpp by Lizonghang

Distributed llama.cpp implementation for low-resource LLM inference

Created 1 year ago

Updated 5 months ago

xFasterTransformer by intel

Optimized solution for LLM inference on X86 platforms

Created 2 years ago

Updated 3 months ago

InferLLM by MegEngine

Lightweight LLM inference framework

Created 2 years ago

Updated 1 year ago

Starred by

Casper Hansen

Casper Hansen(Author of AutoAWQ).

TinyChatEngine by mit-han-lab

On-device LLM/VLM inference library for edge deployment

Created 2 years ago

Updated 1 year ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory) and

Ying Sheng

Ying Sheng(Coauthor of SGLang).

GPTQModel by ModelCloud

LLM compression toolkit for accelerated CPU/GPU inference

Created 1 year ago

Updated 2 days ago

Starred by

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow) and

Georgi Gerganov

Georgi Gerganov(Author of llama.cpp, whisper.cpp).

distributed-llama by b4rtaz

CLI tool for distributed LLM inference across networked devices

Created 2 years ago

Updated 1 month ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow), and

2 more.

torchchat by pytorch

PyTorch-native SDK for local LLM inference across diverse platforms

Created 1 year ago

Updated 4 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

7 more.

llm-awq by mit-han-lab

Weight quantization research paper for LLM compression/acceleration

Created 2 years ago

Updated 5 months ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI) and

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

Inference optimization for LLMs on low-resource hardware

Created 2 years ago

Updated 4 months ago

Feedback? Help us improve.