FlashRT by flashrt-project

High-performance realtime inference engine for AI workloads

Created 3 months ago

442 stars

Top 67.0% on SourcePulse

Project Summary

Summary

FlashRT is a high-performance inference engine for latency-sensitive, small-batch AI workloads, targeting VLA control and LLM inference. It delivers low latency via hand-tuned kernels and static graph composition, enabling rapid AI execution across diverse NVIDIA hardware from edge to server.

How It Works

FlashRT avoids ONNX export and engine compilation by composing pre-written, hardware-agnostic CUDA kernels into static graphs. It leverages hand-tuned kernels for core operations (norm, activation, fusion, FP8/NVFP4 GEMM, attention) and includes vendored Flash-Attention 2. The engine uses static FP8/NVFP4 quantization with auto-calibration and captures the entire forward pass as a CUDA Graph, enabling zero Python overhead during inference replay for sub-millisecond latency.

Quick Start & Requirements

Installation: Pre-built Docker images (ghcr.io/liangsu8899/flashrt:latest) are recommended. Native Linux builds require cloning, installing dependencies (PyTorch matching CUDA, cmake, transformers<4.56), and running cmake .. && make -j.
Prerequisites: NVIDIA GPU (SM80+ recommended), compatible NVIDIA driver (545+ for CUDA 13), CUDA Toolkit (12.4+), Python 3.10-3.12. Specific pinned versions are critical for JAX.
Setup Time: ~6 minutes for native build from git clone.
Links: Docker README (docker/README.md), Native Install Guide (docs/INSTALL.md), API Examples (examples/).

Highlighted Details

LLM Inference: Qwen3.6-27B NVFP4 with 256K context on RTX 5090 (~145 tok/s), OpenAI-compatible server.
VLA Control: Production-validated for Pi0, Pi0.5, GROOT N1.6, Pi0-FAST. Pi0.5 achieves 17.58 ms on RTX 5090.
Performance: ~16-18x speedup on Jetson AGX Thor for Pi0.5 vs OpenPI baseline.
Quantization: Production FP8/NVFP4 with auto-calibration; optional NVFP4 encoder FFN for Pi0.5.
Zero-Overhead Inference: Static CUDA Graph capture for replay.
Hardware Agnostic: Unified code path across Jetson Thor, RTX 4090, RTX 5090.
Framework Support: PyTorch and JAX frontends share a single kernel binary.

Maintenance & Community

Described as a "solo project," with community contributions focused on benchmarks and bug reports. No explicit community channels are listed.

Licensing & Compatibility

The README does not specify a software license, posing a significant adoption risk for commercial use.

Limitations & Caveats

Optimized for NVIDIA hardware with specific driver/CUDA versions. Some components are in "research preview" or "beta." Native Linux builds require careful environment management. The lack of a stated license is a critical barrier.

FlashRT by flashrt-project

Explore Similar Projects

vllm-swift by TheTom

ntransformer by xaskasdf

varuna by microsoft

ScaleLLM by vectorch-ai

tiny-vllm by jmaczan

kernl by ELS-RD

vllm-turboquant by mitkox

Tutel by microsoft

bolt by huawei-noah

tract by sonos

xla by pytorch

vllm by vllm-project