kuiperdatawhale  by zjhellofss

Course for building a deep learning inference framework

Created 2 years ago
296 stars

Top 89.5% on SourcePulse

GitHubView on GitHub
Project Summary

This project offers an educational framework for building a custom Large Language Model (LLM) inference engine from scratch, targeting engineers and researchers interested in deep learning internals. It provides hands-on experience with CUDA programming, model quantization, and efficient inference techniques, enabling users to understand and replicate the performance of commercial LLM inference solutions.

How It Works

The framework guides users through implementing core LLM components, including tensor and operator classes, and a registration system. It details the process of loading LLM weights using memory mapping, implementing KV Cache, and developing custom CUDA kernels for operations like RMSNorm, Softmax, and matrix multiplication. The project emphasizes int8 quantization for reduced memory footprint and accelerated inference.

Quick Start & Requirements

  • Install/Run: Compile from source using CMake.
  • Prerequisites: CUDA Toolkit, C++ compiler, Google glog, Google gtest, SentencePiece, Armadillo with OpenBLAS (or Intel MKL).
  • Models: Llama 2, Llama 3.2, TinyLlama, Qwen2.5. Download links and export scripts are provided.
  • Setup: Requires compiling C++ code and downloading model weights.
  • Docs: Course outline available at https://l0kzvikuq0w.feishu.cn/docx/ZF2hd0xfAoaXqaxcpn2c5oHAnBc.

Highlighted Details

  • Full implementation of LLM inference from scratch, including custom CUDA operators.
  • Support for Llama 2, Llama 3.2, and Qwen2.5 models.
  • Int8 quantization for memory efficiency and performance gains.
  • Detailed explanations of Transformer architecture and CUDA programming for LLMs.

Maintenance & Community

The project is presented as an open-source course with significant community interest (2.4k stars). Contact information (WeChat ID: lyrry1997) is provided for course inquiries.

Licensing & Compatibility

The README does not explicitly state a license. Dependencies include libraries with various licenses (e.g., Google libraries, SentencePiece, Armadillo). Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as an educational course, implying a focus on learning rather than production-readiness. Specific CUDA kernel implementations are marked as "content to be determined." The lack of an explicit license may pose adoption challenges.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

AQLM by Vahe1994

0.4%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
Created 1 year ago
Updated 1 month ago
Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

neural-compressor by intel

0.2%
2k
Python library for model compression (quantization, pruning, distillation, NAS)
Created 5 years ago
Updated 16 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.2%
6k
PyTorch implementation for Google's Gemma models
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.