InferLLM by MegEngine

Lightweight LLM inference framework

Created 2 years ago

747 stars

Top 46.6% on SourcePulse

Project Summary

InferLLM is a lightweight LLM inference framework designed for efficient local deployment of quantized models. It targets developers and researchers seeking a simpler, more modular alternative to projects like llama.cpp, offering improved readability and maintainability while retaining high performance across various architectures.

How It Works

InferLLM adopts a decoupled architecture, separating the framework logic from low-level kernels. This design choice enhances code clarity and facilitates easier modification compared to monolithic projects. It ports many of llama.cpp's optimized kernels and introduces a dedicated KV storage type for efficient caching and management, aiming for high inference speeds on both CPU and GPU.

Quick Start & Requirements

Install: Compile from source using CMake. Enable GPU support with cmake -DENABLE_GPU=ON ...
Prerequisites: CUDA Toolkit (if enabling GPU), NDK (for Android cross-compilation).
Models: Compatible with llama.cpp models, downloadable from Hugging Face.
Docs: ChatGLM model documentation

Highlighted Details

Supports multiple model formats including Alpaca, Llama, Llama-2, ChatGLM/ChatGLM2, and Baichuan.
Optimized for Arm, x86, CUDA, and RISC-V vector architectures.
Achieves acceptable inference speeds on mobile devices.
Performance optimizations include int4 matmul kernels with ARM assembly and kernel packing.

Maintenance & Community

The project appears to be actively maintained, with recent updates adding support for Llama-2 and performance optimizations. Specific community channels or contributor details are not prominently featured in the README.

Licensing & Compatibility

Licensed under the Apache License, Version 2.0. This permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

Currently, only CUDA is supported for GPU acceleration. The project primarily focuses on int4 quantized models, and while it supports various architectures, performance tuning might be required for specific hardware.

InferLLM by MegEngine

Explore Similar Projects

mlx-llm by riccardomusmeci

Kolosal by KolosalAI

ScaleLLM by vectorch-ai

llama2.rs by srush

xFasterTransformer by intel

TinyChatEngine by mit-han-lab

KuiperLLama by zjhellofss

aphrodite-engine by aphrodite-engine

distributed-llama by b4rtaz

mlx-lm by ml-explore

airllm by lyogavin

llama.cpp by ggml-org