rwkv.cpp  by RWKV

CPU inference lib for RWKV language model

Created 2 years ago
1,544 stars

Top 27.0% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a C library and Python wrapper for efficient inference of RWKV language models on CPUs, supporting various quantization formats (INT4, INT5, INT8) and FP16. It targets developers and researchers needing to run large language models with reduced memory and computational requirements, especially for long contexts where RWKV's linear attention is advantageous.

How It Works

RWKV models are ported to the ggml library, enabling CPU-optimized inference. The architecture's state-space design allows it to process sequences with constant memory and computation per token, unlike Transformer models with quadratic attention. This makes it particularly suitable for CPU-bound workloads and long context windows. The project supports multiple RWKV versions (v4, v5, v6, v7) and LoRA checkpoint merging.

Quick Start & Requirements

  • Install: Clone the repository (git clone --recursive), then build the library using CMake (cmake . && cmake --build . --config Release). Pre-compiled binaries are available on the Releases page.
  • Prerequisites: CMake, C++ compiler (Build Tools for Visual Studio on Windows), Python 3.x with PyTorch and NumPy for model conversion and running. Optional: CUDA or hipBLAS for GPU acceleration.
  • Model Conversion: Requires downloading PyTorch models from Hugging Face and converting them using provided Python scripts (convert_pytorch_to_ggml.py, quantize.py).
  • Running: Execute inference via Python scripts (generate_completions.py, chat_with_bot.py).
  • Docs: https://github.com/saharNooby/rwkv.cpp

Highlighted Details

  • Supports INT4, INT5, INT8, FP16, and FP32 inference.
  • Offers significant performance gains on CPU via quantization and AVX2.
  • GPU acceleration via cuBLAS and hipBLAS is supported, offloading layers to GPU.
  • Benchmarks show reduced latency and file sizes with quantization.
  • Python wrapper and C API available for integration.

Maintenance & Community

  • Actively maintained, with support for the latest RWKV architectures.
  • Bindings for Golang and Node.js are available.

Licensing & Compatibility

  • The project is licensed under the MIT License, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

  • GPU acceleration (cuBLAS/hipBLAS) only supports ggml_mul_mat(), requiring some CPU resources for other operations.
  • ggml library updates can occasionally break compatibility with older model file formats; users should refer to docs/FILE_FORMAT.md for version tracking.
Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Jeremy Howard Jeremy Howard(Cofounder of fast.ai).

GPTFast by MDK8888

0%
687
HF Transformers accelerator for faster inference
Created 1 year ago
Updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.2%
6k
PyTorch implementation for Google's Gemma models
Created 1 year ago
Updated 3 months ago
Starred by Bojan Tunguz Bojan Tunguz(AI Scientist; Formerly at NVIDIA), Alex Chen Alex Chen(Cofounder of Nexa AI), and
19 more.

ggml by ggml-org

0.3%
13k
Tensor library for machine learning
Created 3 years ago
Updated 2 days ago
Feedback? Help us improve.