TinyChatEngine  by mit-han-lab

On-device LLM/VLM inference library for edge deployment

Created 2 years ago
894 stars

Top 40.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

TinyChatEngine is an on-device inference library for Large Language Models (LLMs) and Visual Language Models (VLMs), targeting developers and researchers building edge AI applications. It enables real-time, private LLM/VLM deployment on laptops, cars, and robots by implementing advanced model compression techniques like SmoothQuant and AWQ.

How It Works

TinyChatEngine leverages SmoothQuant and AWQ for LLM compression, reducing model size and computational requirements. SmoothQuant addresses quantization difficulty by migrating it from activations to weights, while AWQ protects salient weight channels by analyzing activation magnitudes. The core inference engine is a from-scratch C/C++ implementation designed for universal compatibility across x86, ARM (including Apple Silicon), and NVIDIA GPUs, eliminating external library dependencies for a streamlined deployment.

Quick Start & Requirements

  • Install: Clone the repository (git clone --recursive) and install Python dependencies (pip install -r requirements.txt).
  • Prerequisites:
    • macOS: brew install boost llvm, Xcode for Metal compiler.
    • Windows (CPU): GCC compiler with MSYS2, pacman -S --needed base-devel mingw-w64-x86_64-toolchain make unzip git.
    • Windows (NVIDIA GPU - Experimental): CUDA Toolkit, Visual Studio with C/C++ support.
    • NVIDIA GPU: CUDA compute capability >= 6.1.
  • Setup: Download quantized models using python tools/download_model.py. Compile with make chat -j.
  • Docs: Model Zoo, VILA Demo

Highlighted Details

  • Supports Llama-3, Llama-2, Code Llama, Mistral, VILA, LLaVA, and StarCoder models.
  • Offers various quantization methods including FP32, W4A16, W4A32, W4A8, and W8A8.
  • Optimized weight reordering (QM_ARM, QM_x86, QM_CUDA) for specific architectures to minimize runtime overhead.
  • Awarded Best Paper at MLSys 2024 for AWQ and TinyChat.

Maintenance & Community

  • Actively developed by MIT HAN Lab.
  • Recent updates include Llama-3 support and VLM extensions.
  • Related projects: TinyEngine, SmoothQuant, AWQ.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

  • Windows NVIDIA GPU support is experimental.
  • Support for GPUs with compute capability < 6.1 is untested.
  • The license is not specified, which may impact commercial adoption.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

AQLM by Vahe1994

0.4%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
Created 1 year ago
Updated 1 month ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.