InferLLM  by MegEngine

Lightweight LLM inference framework

Created 2 years ago
739 stars

Top 46.9% on SourcePulse

GitHubView on GitHub
Project Summary

InferLLM is a lightweight LLM inference framework designed for efficient local deployment of quantized models. It targets developers and researchers seeking a simpler, more modular alternative to projects like llama.cpp, offering improved readability and maintainability while retaining high performance across various architectures.

How It Works

InferLLM adopts a decoupled architecture, separating the framework logic from low-level kernels. This design choice enhances code clarity and facilitates easier modification compared to monolithic projects. It ports many of llama.cpp's optimized kernels and introduces a dedicated KV storage type for efficient caching and management, aiming for high inference speeds on both CPU and GPU.

Quick Start & Requirements

  • Install: Compile from source using CMake. Enable GPU support with cmake -DENABLE_GPU=ON ...
  • Prerequisites: CUDA Toolkit (if enabling GPU), NDK (for Android cross-compilation).
  • Models: Compatible with llama.cpp models, downloadable from Hugging Face.
  • Docs: ChatGLM model documentation

Highlighted Details

  • Supports multiple model formats including Alpaca, Llama, Llama-2, ChatGLM/ChatGLM2, and Baichuan.
  • Optimized for Arm, x86, CUDA, and RISC-V vector architectures.
  • Achieves acceptable inference speeds on mobile devices.
  • Performance optimizations include int4 matmul kernels with ARM assembly and kernel packing.

Maintenance & Community

The project appears to be actively maintained, with recent updates adding support for Llama-2 and performance optimizations. Specific community channels or contributor details are not prominently featured in the README.

Licensing & Compatibility

Licensed under the Apache License, Version 2.0. This permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

Currently, only CUDA is supported for GPU acceleration. The project primarily focuses on int4 quantized models, and while it supports various architectures, performance tuning might be required for specific hardware.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Ross Wightman Ross Wightman(Author of timm; CV at Hugging Face), Awni Hannun Awni Hannun(Author of MLX; Research Scientist at Apple), and
1 more.

mlx-llm by riccardomusmeci

0%
454
LLM tools/apps for Apple Silicon using MLX
Created 1 year ago
Updated 7 months ago
Starred by Didier Lopes Didier Lopes(Founder of OpenBB), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

mlx-lm by ml-explore

26.1%
2k
Python package for LLM text generation and fine-tuning on Apple silicon
Created 6 months ago
Updated 22 hours ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
54 more.

llama.cpp by ggml-org

0.4%
87k
C/C++ library for local LLM inference
Created 2 years ago
Updated 13 hours ago
Feedback? Help us improve.