ai00_server by Ai00-X

RWKV inference API server with OpenAI-compatible interface

Created 2 years ago

598 stars

Top 54.6% on SourcePulse

Project Summary

AI00 RWKV Server provides a compact, all-in-one inference API for RWKV language models, targeting users who need an efficient and easy-to-deploy LLM solution. It offers OpenAI-compatible API endpoints, enabling chatbots, text generation, and Q&A applications without requiring heavy dependencies like PyTorch or CUDA.

How It Works

The server leverages the web-rwkv inference engine and utilizes Vulkan for parallel and concurrent batched inference. This approach allows it to run on any GPU supporting Vulkan, including AMD and integrated graphics, democratizing GPU acceleration beyond NVIDIA hardware. Its Rust implementation contributes to its compact size and performance.

Quick Start & Requirements

Install: Download pre-built executables from the Releases page.
Prerequisites: Rust (for building from source), Python with torch and safetensors (for model conversion). Models require .st extension.
Setup: Download model, place in assets/models/, modify assets/configs/Config.toml, and run ./ai00_rwkv_server. Access WebUI at http://localhost:65530.
Docs: API Docs

Highlighted Details

Vulkan-accelerated inference on non-NVIDIA GPUs.
OpenAI API compatibility for seamless integration.
Supports RWKV models V5, V6, and V7.
Unique BNF sampling for structured output generation.

Maintenance & Community

The project is actively maintained with a growing community. Users can join via Discord or a QQ group (30920262).

Licensing & Compatibility

Licensed under MIT/Apache-2.0, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

Currently supports Safetensors models (.st); .pth models require conversion. Hot loading/switching of LoRA models is a planned feature.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

6 stars in the last 30 days

Explore Similar Projects

inferflow by inferflow

High-performance LLM inference engine

Created 2 years ago

Updated 1 year ago

web-rwkv by cryscan

WebGPU inference engine for RWKV language model

Created 2 years ago

Updated 1 day ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI) and

Zhuohan Li

Zhuohan Li(Coauthor of vLLM).

calm by zeux

Single-GPU inference engine for rapid LLM prototyping

Created 2 years ago

Updated 7 months ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

ScaleLLM by vectorch-ai

LLM inference system for production environments

Created 2 years ago

Updated 3 weeks ago

Qwen-TensorRT-LLM by Tlntin

TensorRT-LLM acceleration for Qwen models

Created 2 years ago

Updated 1 year ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

2 more.

basaran by hyperonym

Open-source API server for text completion

Created 2 years ago

Updated 1 year ago

Starred by

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

ZhiLight by zhihu

LLM inference engine for Llama and variants, optimized for PCIe GPUs

Created 1 year ago

Updated 6 months ago

Starred by

Simon Willison

Simon Willison(Coauthor of Django),

Casper Hansen

Casper Hansen(Author of AutoAWQ), and

3 more.

rwkv.cpp by RWKV

CPU inference lib for RWKV language model

Created 2 years ago

Updated 9 months ago

esp-dl by espressif

Lightweight NN inference framework for ESP series chips in AIoT

Created 7 years ago

Updated 2 days ago

KuiperInfer by zjhellofss

Deep learning inference library for model deployment

Created 3 years ago

Updated 6 months ago

Starred by

Nat Friedman

Nat Friedman(Former CEO of GitHub),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

15 more.

FasterTransformer by NVIDIA

Optimized transformer library for inference

Created 4 years ago

Updated 1 year ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI),

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and

4 more.

ktransformers by kvcache-ai

Framework for LLM inference optimization experimentation

Created 1 year ago

Updated 1 day ago

Feedback? Help us improve.