On-device LLM/VLM inference library for edge deployment
Top 41.9% on sourcepulse
TinyChatEngine is an on-device inference library for Large Language Models (LLMs) and Visual Language Models (VLMs), targeting developers and researchers building edge AI applications. It enables real-time, private LLM/VLM deployment on laptops, cars, and robots by implementing advanced model compression techniques like SmoothQuant and AWQ.
How It Works
TinyChatEngine leverages SmoothQuant and AWQ for LLM compression, reducing model size and computational requirements. SmoothQuant addresses quantization difficulty by migrating it from activations to weights, while AWQ protects salient weight channels by analyzing activation magnitudes. The core inference engine is a from-scratch C/C++ implementation designed for universal compatibility across x86, ARM (including Apple Silicon), and NVIDIA GPUs, eliminating external library dependencies for a streamlined deployment.
Quick Start & Requirements
git clone --recursive
) and install Python dependencies (pip install -r requirements.txt
).brew install boost llvm
, Xcode for Metal compiler.pacman -S --needed base-devel mingw-w64-x86_64-toolchain make unzip git
.python tools/download_model.py
. Compile with make chat -j
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
1+ week