TinyChatEngine by mit-han-lab

On-device LLM/VLM inference library for edge deployment

Created 2 years ago

936 stars

Top 39.1% on SourcePulse

1 Expert Loves This Project

casper-hansen

Author of AutoAWQ

Project Summary

TinyChatEngine is an on-device inference library for Large Language Models (LLMs) and Visual Language Models (VLMs), targeting developers and researchers building edge AI applications. It enables real-time, private LLM/VLM deployment on laptops, cars, and robots by implementing advanced model compression techniques like SmoothQuant and AWQ.

How It Works

TinyChatEngine leverages SmoothQuant and AWQ for LLM compression, reducing model size and computational requirements. SmoothQuant addresses quantization difficulty by migrating it from activations to weights, while AWQ protects salient weight channels by analyzing activation magnitudes. The core inference engine is a from-scratch C/C++ implementation designed for universal compatibility across x86, ARM (including Apple Silicon), and NVIDIA GPUs, eliminating external library dependencies for a streamlined deployment.

Quick Start & Requirements

Install: Clone the repository (git clone --recursive) and install Python dependencies (pip install -r requirements.txt).
Prerequisites:
- macOS: brew install boost llvm, Xcode for Metal compiler.
- Windows (CPU): GCC compiler with MSYS2, pacman -S --needed base-devel mingw-w64-x86_64-toolchain make unzip git.
- Windows (NVIDIA GPU - Experimental): CUDA Toolkit, Visual Studio with C/C++ support.
- NVIDIA GPU: CUDA compute capability >= 6.1.
Setup: Download quantized models using python tools/download_model.py. Compile with make chat -j.
Docs: Model Zoo, VILA Demo

Highlighted Details

Supports Llama-3, Llama-2, Code Llama, Mistral, VILA, LLaVA, and StarCoder models.
Offers various quantization methods including FP32, W4A16, W4A32, W4A8, and W8A8.
Optimized weight reordering (QM_ARM, QM_x86, QM_CUDA) for specific architectures to minimize runtime overhead.
Awarded Best Paper at MLSys 2024 for AWQ and TinyChat.

Maintenance & Community

Actively developed by MIT HAN Lab.
Recent updates include Llama-3 support and VLM extensions.
Related projects: TinyEngine, SmoothQuant, AWQ.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

Windows NVIDIA GPU support is experimental.
Support for GPUs with compute capability < 6.1 is untested.
The license is not specified, which may impact commercial adoption.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

4 stars in the last 30 days

Explore Similar Projects

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Jeffrey Morgan

Jeffrey Morgan(Cofounder of Ollama).

neural-speed by intel

Library for efficient LLM inference via low-bit quantization

Created 2 years ago

Updated 1 year ago

Starred by

Artidoro Pagnoni

Artidoro Pagnoni(Coauthor of QLoRA; Research Scientist at Meta),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

2 more.

SpQR by Vahe1994

Weight compression research paper for near-lossless LLM quantization

Created 2 years ago

Updated 1 year ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen),

Woosuk Kwon

Woosuk Kwon(Coauthor of vLLM), and

1 more.

SqueezeLLM by SqueezeAILab

Quantization framework for efficient LLM serving (ICML 2024 paper)

Created 2 years ago

Updated 1 year ago

LightCompress by ModelTC

PyTorch toolkit for LLM quantization research and deployment

Created 1 year ago

Updated 1 month ago

Starred by

Salvatore Sanfilippo

Salvatore Sanfilippo(Author of Redis).

prima.cpp by Lizonghang

Distributed llama.cpp implementation for low-resource LLM inference

Created 1 year ago

Updated 5 months ago

Starred by

Alex Chen

Alex Chen(Cofounder of Nexa AI),

Zack Li

Zack Li(Cofounder of Nexa AI), and

1 more.

deepcompressor by nunchaku-tech

Model compression toolbox for LLMs and diffusion models

Created 1 year ago

Updated 5 months ago

InferLLM by MegEngine

Lightweight LLM inference framework

Created 2 years ago

Updated 1 year ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory) and

Ying Sheng

Ying Sheng(Coauthor of SGLang).

GPTQModel by ModelCloud

LLM compression toolkit for accelerated CPU/GPU inference

Created 1 year ago

Updated 2 days ago

ik_llama.cpp by ikawrakow

`llama.cpp` fork for improved CPU/GPU performance

Created 1 year ago

Updated 1 day ago

Starred by

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face),

Maxime Labonne

Maxime Labonne(Head of Post-Training at Liquid AI), and

5 more.

AQLM by Vahe1994

PyTorch code for LLM compression via Additive Quantization (AQLM)

Created 2 years ago

Updated 5 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

7 more.

llm-awq by mit-han-lab

Weight quantization research paper for LLM compression/acceleration

Created 2 years ago

Updated 5 months ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI) and

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

Inference optimization for LLMs on low-resource hardware

Created 2 years ago

Updated 4 months ago

Feedback? Help us improve.