KuiperLLama by zjhellofss

LLM inference framework for hands-on learning (Llama2/3, Qwen2.5)

Created 1 year ago

483 stars

Top 63.5% on SourcePulse

Project Summary

This project provides a C++-based large language model inference framework, KuiperLLama, designed for educational purposes and practical application in LLM development. It targets students and developers interested in understanding and implementing LLM inference from scratch, offering a hands-on approach to building a performant inference engine.

How It Works

The framework is built using modern C++20 standards, emphasizing clean code, robust error handling, and project management via CMake and Git. It features a dual backend approach, supporting both CPU and CUDA accelerated inference. The CUDA backend utilizes custom-written CUDA kernels for optimized performance, and the framework supports INT8 quantization for reduced memory footprint and faster inference.

Quick Start & Requirements

Install/Run: Compile from source using CMake.
Prerequisites: C++20 compiler, CMake, CUDA Toolkit, Google glog, Google gtest, SentencePiece, Armadillo + OpenBLAS. The USE_CPM=ON CMake option can automate dependency downloads.
Models: Llama 2/3, Qwen 2.5. Model weights can be downloaded from Hugging Face or provided Baidu links.
Resources: Requires a CUDA-enabled GPU for accelerated inference.
Docs: Course details and structure are available at https://l0kzvikuq0w.feishu.cn/docx/ZF2hd0xfAoaXqaxcpn2c5oHAnBc.

Highlighted Details

Supports Llama 2/3 (including Llama 3.2) and Qwen 2.5 models.
Features custom CUDA kernels for performance optimization.
Includes INT8 quantization support.
Provides unit testing and benchmarking guidance.

Maintenance & Community

The project is associated with the KuiperInfer course, which has achieved 2.5k stars on GitHub. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as an educational course, implying a focus on learning rather than production-readiness. Specific performance benchmarks are provided for a single hardware configuration (Nvidia 3060 laptop). The availability and stability of custom CUDA kernels for all supported models and hardware configurations may require further validation.

KuiperLLama by zjhellofss

Explore Similar Projects

fp6_llm by usyd-fsalab

EXAONE-Deep by LG-AI-EXAONE

calm by zeux

ScaleLLM by vectorch-ai

llm-inference-benchmark by ninehills

crabml by crabml

llama2.rs by srush

kuiperdatawhale by zjhellofss

InferLLM by MegEngine

GPTQModel by ModelCloud

KuiperInfer by zjhellofss

airllm by lyogavin