CPU inference lib for RWKV language model
Top 27.6% on sourcepulse
This project provides a C library and Python wrapper for efficient inference of RWKV language models on CPUs, supporting various quantization formats (INT4, INT5, INT8) and FP16. It targets developers and researchers needing to run large language models with reduced memory and computational requirements, especially for long contexts where RWKV's linear attention is advantageous.
How It Works
RWKV models are ported to the ggml
library, enabling CPU-optimized inference. The architecture's state-space design allows it to process sequences with constant memory and computation per token, unlike Transformer models with quadratic attention. This makes it particularly suitable for CPU-bound workloads and long context windows. The project supports multiple RWKV versions (v4, v5, v6, v7) and LoRA checkpoint merging.
Quick Start & Requirements
git clone --recursive
), then build the library using CMake (cmake . && cmake --build . --config Release
). Pre-compiled binaries are available on the Releases page.convert_pytorch_to_ggml.py
, quantize.py
).generate_completions.py
, chat_with_bot.py
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
ggml_mul_mat()
, requiring some CPU resources for other operations.ggml
library updates can occasionally break compatibility with older model file formats; users should refer to docs/FILE_FORMAT.md
for version tracking.4 months ago
1 day