kuiperdatawhale by zjhellofss

Course for building a deep learning inference framework

Created 2 years ago

312 stars

Top 86.5% on SourcePulse

Project Summary

This project offers an educational framework for building a custom Large Language Model (LLM) inference engine from scratch, targeting engineers and researchers interested in deep learning internals. It provides hands-on experience with CUDA programming, model quantization, and efficient inference techniques, enabling users to understand and replicate the performance of commercial LLM inference solutions.

How It Works

The framework guides users through implementing core LLM components, including tensor and operator classes, and a registration system. It details the process of loading LLM weights using memory mapping, implementing KV Cache, and developing custom CUDA kernels for operations like RMSNorm, Softmax, and matrix multiplication. The project emphasizes int8 quantization for reduced memory footprint and accelerated inference.

Quick Start & Requirements

Install/Run: Compile from source using CMake.
Prerequisites: CUDA Toolkit, C++ compiler, Google glog, Google gtest, SentencePiece, Armadillo with OpenBLAS (or Intel MKL).
Models: Llama 2, Llama 3.2, TinyLlama, Qwen2.5. Download links and export scripts are provided.
Setup: Requires compiling C++ code and downloading model weights.
Docs: Course outline available at https://l0kzvikuq0w.feishu.cn/docx/ZF2hd0xfAoaXqaxcpn2c5oHAnBc.

Highlighted Details

Full implementation of LLM inference from scratch, including custom CUDA operators.
Support for Llama 2, Llama 3.2, and Qwen2.5 models.
Int8 quantization for memory efficiency and performance gains.
Detailed explanations of Transformer architecture and CUDA programming for LLMs.

Maintenance & Community

The project is presented as an open-source course with significant community interest (2.4k stars). Contact information (WeChat ID: lyrry1997) is provided for course inquiries.

Licensing & Compatibility

The README does not explicitly state a license. Dependencies include libraries with various licenses (e.g., Google libraries, SentencePiece, Armadillo). Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as an educational course, implying a focus on learning rather than production-readiness. Specific CUDA kernel implementations are marked as "content to be determined." The lack of an explicit license may pose adoption challenges.

kuiperdatawhale by zjhellofss

Explore Similar Projects

ScaleLLM by vectorch-ai

cuda_learning by ifromeast

hqq by dropbox

InferLLM by MegEngine

Telechat by Tele-AI

lightning-thunder by Lightning-AI

KuiperLLama by zjhellofss

AQLM by Vahe1994

neural-compressor by intel

KuiperInfer by zjhellofss

MegEngine by MegEngine

gemma_pytorch by google