rwkv.cpp by RWKV

CPU inference lib for RWKV language model

Created 2 years ago

1,556 stars

Top 26.5% on SourcePulse

View on GitHub

5 Experts Love This Project

Author of Bend, Kind, HVM

Benjamin Bolte

Cofounder of K-Scale Labs

and 1 more!

Project Summary

This project provides a C library and Python wrapper for efficient inference of RWKV language models on CPUs, supporting various quantization formats (INT4, INT5, INT8) and FP16. It targets developers and researchers needing to run large language models with reduced memory and computational requirements, especially for long contexts where RWKV's linear attention is advantageous.

How It Works

RWKV models are ported to the ggml library, enabling CPU-optimized inference. The architecture's state-space design allows it to process sequences with constant memory and computation per token, unlike Transformer models with quadratic attention. This makes it particularly suitable for CPU-bound workloads and long context windows. The project supports multiple RWKV versions (v4, v5, v6, v7) and LoRA checkpoint merging.

Quick Start & Requirements

Install: Clone the repository (git clone --recursive), then build the library using CMake (cmake . && cmake --build . --config Release). Pre-compiled binaries are available on the Releases page.
Prerequisites: CMake, C++ compiler (Build Tools for Visual Studio on Windows), Python 3.x with PyTorch and NumPy for model conversion and running. Optional: CUDA or hipBLAS for GPU acceleration.
Model Conversion: Requires downloading PyTorch models from Hugging Face and converting them using provided Python scripts (convert_pytorch_to_ggml.py, quantize.py).
Running: Execute inference via Python scripts (generate_completions.py, chat_with_bot.py).
Docs: https://github.com/saharNooby/rwkv.cpp

Highlighted Details

Supports INT4, INT5, INT8, FP16, and FP32 inference.
Offers significant performance gains on CPU via quantization and AVX2.
GPU acceleration via cuBLAS and hipBLAS is supported, offloading layers to GPU.
Benchmarks show reduced latency and file sizes with quantization.
Python wrapper and C API available for integration.

Maintenance & Community

Actively maintained, with support for the latest RWKV architectures.
Bindings for Golang and Node.js are available.

Licensing & Compatibility

The project is licensed under the MIT License, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

GPU acceleration (cuBLAS/hipBLAS) only supports ggml_mul_mat(), requiring some CPU resources for other operations.
ggml library updates can occasionally break compatibility with older model file formats; users should refer to docs/FILE_FORMAT.md for version tracking.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days