Discover and explore top open-source AI tools and projects—updated daily.
LiangSu8899High-performance realtime inference engine for AI workloads
Top 83.8% on SourcePulse
Summary
FlashRT is a high-performance inference engine for latency-sensitive, small-batch AI workloads, targeting VLA control and LLM inference. It delivers low latency via hand-tuned kernels and static graph composition, enabling rapid AI execution across diverse NVIDIA hardware from edge to server.
How It Works
FlashRT avoids ONNX export and engine compilation by composing pre-written, hardware-agnostic CUDA kernels into static graphs. It leverages hand-tuned kernels for core operations (norm, activation, fusion, FP8/NVFP4 GEMM, attention) and includes vendored Flash-Attention 2. The engine uses static FP8/NVFP4 quantization with auto-calibration and captures the entire forward pass as a CUDA Graph, enabling zero Python overhead during inference replay for sub-millisecond latency.
Quick Start & Requirements
ghcr.io/liangsu8899/flashrt:latest) are recommended. Native Linux builds require cloning, installing dependencies (PyTorch matching CUDA, cmake, transformers<4.56), and running cmake .. && make -j.git clone.docker/README.md), Native Install Guide (docs/INSTALL.md), API Examples (examples/).Highlighted Details
Maintenance & Community
Described as a "solo project," with community contributions focused on benchmarks and bug reports. No explicit community channels are listed.
Licensing & Compatibility
The README does not specify a software license, posing a significant adoption risk for commercial use.
Limitations & Caveats
Optimized for NVIDIA hardware with specific driver/CUDA versions. Some components are in "research preview" or "beta." Native Linux builds require careful environment management. The lack of a stated license is a critical barrier.
1 day ago
Inactive
microsoft
ELS-RD
sonos
vllm-project