realtime-vla by dexmal

Real-time Visual-Language Agent inference

Created 8 months ago

586 stars

Top 54.7% on SourcePulse

Project Summary

This project provides accelerated inference kernels for the Pi0 VLA model, enabling real-time performance for applications requiring fast visual understanding and action. It targets researchers and developers in robotics and embodied AI, offering significant latency reductions for complex tasks, demonstrated by a real-world falling pen catch with sub-200ms latency.

How It Works

The architecture decomposes VLA computation into a vision encoder, LLM, and action expert, simplifying the entire pipeline to 24 GEMM-like operations. This modular structure is optimized using custom Triton kernels for efficient GPU execution, achieving high inference frequencies.

Quick Start & Requirements

Usage: Copy pi0_infer.py into your project. Use convert_from_jax.py to load checkpoints. Example Python API: infer.forward(normalized_observation_image_bfloat16, observation_state_bfloat16, diffusion_input_noise_bfloat16).
Prerequisites: RTX 4090 (tuned for), CUDA 12.6 (tuned for), torch, triton.
Links: Citation points to arXiv preprint: arXiv:2510.26742.

Highlighted Details

Achieves 30Hz VLA inference and 480Hz trajectory frequency.
Demonstrated sub-200ms end-to-end latency for tasks like catching a falling pen.
Inference times on RTX 4090: ~20ms (1 view, no prompt), ~39.2ms (3 views, 20-token prompt).
Recommended inference rates: 30fps for 1-2 views, 25fps for 3 views to match camera speeds.

Maintenance & Community

No specific details on maintenance, community channels, or contributors are provided in the README snippet.

Licensing & Compatibility

The license type is not explicitly mentioned in the provided README snippet.

Limitations & Caveats

The inference kernels are specifically tuned for RTX 4090 and CUDA 12.6, though they are expected to function on similar platforms supporting torch and triton.

realtime-vla by dexmal

Explore Similar Projects

RLDX-1 by RLWRLD

vlash by mit-han-lab

CogACT by microsoft

molmoact by allenai

WildDet3D by allenai

GalaxeaVLA by OpenGalaxea

SpatialVLA by SpatialVLA

JoyAI-VL-Interaction by jd-opensource

lingbot-vla by Robbyant

Qwen3.6 by QwenLM

Isaac-GR00T by NVIDIA

openpi by Physical-Intelligence