rwkv.cpp  by RWKV

CPU inference lib for RWKV language model

created 2 years ago
1,536 stars

Top 27.6% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a C library and Python wrapper for efficient inference of RWKV language models on CPUs, supporting various quantization formats (INT4, INT5, INT8) and FP16. It targets developers and researchers needing to run large language models with reduced memory and computational requirements, especially for long contexts where RWKV's linear attention is advantageous.

How It Works

RWKV models are ported to the ggml library, enabling CPU-optimized inference. The architecture's state-space design allows it to process sequences with constant memory and computation per token, unlike Transformer models with quadratic attention. This makes it particularly suitable for CPU-bound workloads and long context windows. The project supports multiple RWKV versions (v4, v5, v6, v7) and LoRA checkpoint merging.

Quick Start & Requirements

  • Install: Clone the repository (git clone --recursive), then build the library using CMake (cmake . && cmake --build . --config Release). Pre-compiled binaries are available on the Releases page.
  • Prerequisites: CMake, C++ compiler (Build Tools for Visual Studio on Windows), Python 3.x with PyTorch and NumPy for model conversion and running. Optional: CUDA or hipBLAS for GPU acceleration.
  • Model Conversion: Requires downloading PyTorch models from Hugging Face and converting them using provided Python scripts (convert_pytorch_to_ggml.py, quantize.py).
  • Running: Execute inference via Python scripts (generate_completions.py, chat_with_bot.py).
  • Docs: https://github.com/saharNooby/rwkv.cpp

Highlighted Details

  • Supports INT4, INT5, INT8, FP16, and FP32 inference.
  • Offers significant performance gains on CPU via quantization and AVX2.
  • GPU acceleration via cuBLAS and hipBLAS is supported, offloading layers to GPU.
  • Benchmarks show reduced latency and file sizes with quantization.
  • Python wrapper and C API available for integration.

Maintenance & Community

  • Actively maintained, with support for the latest RWKV architectures.
  • Bindings for Golang and Node.js are available.

Licensing & Compatibility

  • The project is licensed under the MIT License, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

  • GPU acceleration (cuBLAS/hipBLAS) only supports ggml_mul_mat(), requiring some CPU resources for other operations.
  • ggml library updates can occasionally break compatibility with older model file formats; users should refer to docs/FILE_FORMAT.md for version tracking.
Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
29 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), and
1 more.

mpt-30B-inference by abacaj

0%
575
CPU inference code for MPT-30B
created 2 years ago
updated 2 years ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tim J. Baek Tim J. Baek(Founder of Open WebUI), and
5 more.

gemma.cpp by google

0.1%
7k
C++ inference engine for Google's Gemma models
created 1 year ago
updated 1 day ago
Starred by Bojan Tunguz Bojan Tunguz(AI Scientist; Formerly at NVIDIA), Mckay Wrigley Mckay Wrigley(Founder of Takeoff AI), and
8 more.

ggml by ggml-org

0.3%
13k
Tensor library for machine learning
created 2 years ago
updated 3 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 14 hours ago
Feedback? Help us improve.