TinyChatEngine  by mit-han-lab

On-device LLM/VLM inference library for edge deployment

created 2 years ago
878 stars

Top 41.9% on sourcepulse

GitHubView on GitHub
Project Summary

TinyChatEngine is an on-device inference library for Large Language Models (LLMs) and Visual Language Models (VLMs), targeting developers and researchers building edge AI applications. It enables real-time, private LLM/VLM deployment on laptops, cars, and robots by implementing advanced model compression techniques like SmoothQuant and AWQ.

How It Works

TinyChatEngine leverages SmoothQuant and AWQ for LLM compression, reducing model size and computational requirements. SmoothQuant addresses quantization difficulty by migrating it from activations to weights, while AWQ protects salient weight channels by analyzing activation magnitudes. The core inference engine is a from-scratch C/C++ implementation designed for universal compatibility across x86, ARM (including Apple Silicon), and NVIDIA GPUs, eliminating external library dependencies for a streamlined deployment.

Quick Start & Requirements

  • Install: Clone the repository (git clone --recursive) and install Python dependencies (pip install -r requirements.txt).
  • Prerequisites:
    • macOS: brew install boost llvm, Xcode for Metal compiler.
    • Windows (CPU): GCC compiler with MSYS2, pacman -S --needed base-devel mingw-w64-x86_64-toolchain make unzip git.
    • Windows (NVIDIA GPU - Experimental): CUDA Toolkit, Visual Studio with C/C++ support.
    • NVIDIA GPU: CUDA compute capability >= 6.1.
  • Setup: Download quantized models using python tools/download_model.py. Compile with make chat -j.
  • Docs: Model Zoo, VILA Demo

Highlighted Details

  • Supports Llama-3, Llama-2, Code Llama, Mistral, VILA, LLaVA, and StarCoder models.
  • Offers various quantization methods including FP32, W4A16, W4A32, W4A8, and W8A8.
  • Optimized weight reordering (QM_ARM, QM_x86, QM_CUDA) for specific architectures to minimize runtime overhead.
  • Awarded Best Paper at MLSys 2024 for AWQ and TinyChat.

Maintenance & Community

  • Actively developed by MIT HAN Lab.
  • Recent updates include Llama-3 support and VLM extensions.
  • Related projects: TinyEngine, SmoothQuant, AWQ.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

  • Windows NVIDIA GPU support is experimental.
  • Support for GPUs with compute capability < 6.1 is untested.
  • The license is not specified, which may impact commercial adoption.
Health Check
Last commit

1 year ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
39 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 17 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
6 more.

AutoGPTQ by AutoGPTQ

0.1%
5k
LLM quantization package using GPTQ algorithm
created 2 years ago
updated 3 months ago
Feedback? Help us improve.