StreamVLN by InternRobotics

Vision-and-language navigation for real-time robotic interaction

Created 6 months ago

372 stars

Top 76.1% on SourcePulse

Project Summary

Summary

StreamVLN enables real-time, multi-turn Vision-and-Language Navigation (VLN) from continuous video input. Extending LLaVA-Video, it models interleaved vision, language, and actions with efficient context handling for long sequences and online interaction. This project targets embodied AI and robotics researchers, offering a foundation for advanced autonomous navigation.

How It Works

Built on LLaVA-Video, StreamVLN uses "slowFast context modeling." It employs a fast-streaming dialogue context via a sliding-window KV cache for immediate responses and a slow-updating memory with token pruning for long-term context. This dual-stream approach balances computational demands with environmental understanding for navigation.

Quick Start & Requirements

Environment: Python 3.9, PyTorch 2.1.2, CUDA 12.4.
Installation: Requires conda environment setup, installing habitat-sim (v0.2.4) and habitat-lab (v0.2.4) from source, cloning the StreamVLN repo, and installing dependencies.
Data Preparation: Extensive data prep needed: MP3D/HM3D scenes, VLN-CE episodes (R2R, RxR, EnvDrop, ScaleVLN), trajectory data (Hugging Face), and co-training datasets (LLaVA-Video-178K, ScanNet, MMC4).
Links: Project Page: https://streamvln.github.io/, arXiv: http://arxiv.org/abs/2507.05240.

Highlighted Details

Performance: Updated checkpoint achieves SOTA on R2R (NE:4.90, SPL:50.2) and RxR (SR:54.4, SPL:45.4) benchmarks using R2R_VLNCE_v1-3.
Real-World Deployment: Code and guide available for Unitree Go2 robot deployment, featuring enhanced safety and instruction alignment.
Data & Co-training: Released code for Dagger data collection and supports co-training with LLaVA-Video-178K.

Maintenance & Community

Recent activity (Sept 2025) indicates active development. No specific community channels or explicit maintainer details are provided beyond the author list.

Licensing & Compatibility

Licensed under CC BY-NC-SA 4.0. Restricts usage to non-commercial purposes and requires derivative works to be shared under the same terms.

Limitations & Caveats

The CC BY-NC-SA 4.0 license prohibits commercial use. Setup is complex, requiring substantial data preparation and multiple dependencies, including specific versions of habitat-sim and habitat-lab.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

36 stars in the last 30 days