FlashRT  by LiangSu8899

High-performance realtime inference engine for AI workloads

Created 1 month ago
325 stars

Top 83.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

FlashRT is a high-performance inference engine for latency-sensitive, small-batch AI workloads, targeting VLA control and LLM inference. It delivers low latency via hand-tuned kernels and static graph composition, enabling rapid AI execution across diverse NVIDIA hardware from edge to server.

How It Works

FlashRT avoids ONNX export and engine compilation by composing pre-written, hardware-agnostic CUDA kernels into static graphs. It leverages hand-tuned kernels for core operations (norm, activation, fusion, FP8/NVFP4 GEMM, attention) and includes vendored Flash-Attention 2. The engine uses static FP8/NVFP4 quantization with auto-calibration and captures the entire forward pass as a CUDA Graph, enabling zero Python overhead during inference replay for sub-millisecond latency.

Quick Start & Requirements

  • Installation: Pre-built Docker images (ghcr.io/liangsu8899/flashrt:latest) are recommended. Native Linux builds require cloning, installing dependencies (PyTorch matching CUDA, cmake, transformers<4.56), and running cmake .. && make -j.
  • Prerequisites: NVIDIA GPU (SM80+ recommended), compatible NVIDIA driver (545+ for CUDA 13), CUDA Toolkit (12.4+), Python 3.10-3.12. Specific pinned versions are critical for JAX.
  • Setup Time: ~6 minutes for native build from git clone.
  • Links: Docker README (docker/README.md), Native Install Guide (docs/INSTALL.md), API Examples (examples/).

Highlighted Details

  • LLM Inference: Qwen3.6-27B NVFP4 with 256K context on RTX 5090 (~145 tok/s), OpenAI-compatible server.
  • VLA Control: Production-validated for Pi0, Pi0.5, GROOT N1.6, Pi0-FAST. Pi0.5 achieves 17.58 ms on RTX 5090.
  • Performance: ~16-18x speedup on Jetson AGX Thor for Pi0.5 vs OpenPI baseline.
  • Quantization: Production FP8/NVFP4 with auto-calibration; optional NVFP4 encoder FFN for Pi0.5.
  • Zero-Overhead Inference: Static CUDA Graph capture for replay.
  • Hardware Agnostic: Unified code path across Jetson Thor, RTX 4090, RTX 5090.
  • Framework Support: PyTorch and JAX frontends share a single kernel binary.

Maintenance & Community

Described as a "solo project," with community contributions focused on benchmarks and bug reports. No explicit community channels are listed.

Licensing & Compatibility

The README does not specify a software license, posing a significant adoption risk for commercial use.

Limitations & Caveats

Optimized for NVIDIA hardware with specific driver/CUDA versions. Some components are in "research preview" or "beta." Native Linux builds require careful environment management. The lack of a stated license is a critical barrier.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
47
Issues (30d)
10
Star History
210 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
62 more.

vllm by vllm-project

0.5%
82k
LLM serving engine for high-throughput, memory-efficient inference
Created 3 years ago
Updated 12 hours ago
Feedback? Help us improve.