xinfer by guoqingbao

Pure Rust LLM inference engine

Created 1 year ago

293 stars

Top 89.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Lysandre Debut

Chief Open-Source Officer at Hugging Face

Project Summary

Summary

xInfer provides blazing-fast LLM inference entirely in pure Rust, eliminating Python and PyTorch dependencies. It targets engineers and power users seeking efficient, portable, and production-ready LLM solutions, offering accelerated inference with a minimal footprint and broad hardware compatibility.

How It Works

Its core is a pure Rust backend, eschewing Python/PyTorch for maximum performance and reduced complexity. xInfer leverages native optimizations like Flash Attention, FlashInfer, CUDA Graphs, continuous batching, and prefix caching. Aggressive KV compression (TurboQuant 2-4 bit) dramatically extends context length (up to 4.3x) with minimal quality loss, enabling large models on consumer GPUs. This yields a tiny footprint and cross-platform support for CUDA (Linux/Windows) and Metal (macOS) via a single binary and API.

Quick Start & Requirements

Installation is via a shell script (curl -sSL https://guoqingbao.github.io/xinfer/install.sh | bash) or npm (npm install -g xinfer-ai). Run models from HuggingFace IDs or local paths using the xinfer CLI, with an option for a built-in ChatGPT-style Web UI (--ui-server). Python usage is also supported (python3 -m xinfer.server). Key requirements include a compatible GPU (NVIDIA CUDA or Apple Silicon Metal). Building from source needs a Rust compiler and potentially CUDA Toolkit or Xcode command-line tools.

xinfer by guoqingbao

Explore Similar Projects

vllm-swift by TheTom

ntransformer by xaskasdf

ScaleLLM by vectorch-ai

usls by jamjamjon

rvllm by m0at

ssd by tanishqkumar

atlas by Avarok-Cybersecurity

vLLM-Kunlun by baidu

picolm by RightNow-AI

lmdeploy by InternLM

colibri by JustVugg

nano-vllm by GeeeekExplorer