InferLLM  by MegEngine

Lightweight LLM inference framework

created 2 years ago
732 stars

Top 48.2% on sourcepulse

GitHubView on GitHub
Project Summary

InferLLM is a lightweight LLM inference framework designed for efficient local deployment of quantized models. It targets developers and researchers seeking a simpler, more modular alternative to projects like llama.cpp, offering improved readability and maintainability while retaining high performance across various architectures.

How It Works

InferLLM adopts a decoupled architecture, separating the framework logic from low-level kernels. This design choice enhances code clarity and facilitates easier modification compared to monolithic projects. It ports many of llama.cpp's optimized kernels and introduces a dedicated KV storage type for efficient caching and management, aiming for high inference speeds on both CPU and GPU.

Quick Start & Requirements

  • Install: Compile from source using CMake. Enable GPU support with cmake -DENABLE_GPU=ON ...
  • Prerequisites: CUDA Toolkit (if enabling GPU), NDK (for Android cross-compilation).
  • Models: Compatible with llama.cpp models, downloadable from Hugging Face.
  • Docs: ChatGLM model documentation

Highlighted Details

  • Supports multiple model formats including Alpaca, Llama, Llama-2, ChatGLM/ChatGLM2, and Baichuan.
  • Optimized for Arm, x86, CUDA, and RISC-V vector architectures.
  • Achieves acceptable inference speeds on mobile devices.
  • Performance optimizations include int4 matmul kernels with ARM assembly and kernel packing.

Maintenance & Community

The project appears to be actively maintained, with recent updates adding support for Llama-2 and performance optimizations. Specific community channels or contributor details are not prominently featured in the README.

Licensing & Compatibility

Licensed under the Apache License, Version 2.0. This permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

Currently, only CUDA is supported for GPU acceleration. The project primarily focuses on int4 quantized models, and while it supports various architectures, performance tuning might be required for specific hardware.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 15 hours ago
Feedback? Help us improve.