Discover and explore top open-source AI tools and projects—updated daily.
mit-han-labReal-time Vision-Language Agent deployment and fine-tuning
Top 77.6% on SourcePulse
Summary VLASH provides an efficient, easy-to-use framework for deploying Vision-Language Agents (VLAs) in real-time, focusing on fast reaction and smooth motion. It targets researchers and engineers needing performant VLA capabilities for robotics and AI, offering optimized inference and simplified fine-tuning on consumer hardware.
How It Works The core approach utilizes asynchronous inference combined with future-state awareness to achieve high reaction speeds and stable operation without overhead. Action quantization further accelerates robot execution. For efficient adaptation, VLASH integrates LoRA with shared observation encoding, enabling fine-tuning on consumer GPUs.
Quick Start & Requirements
Setup requires Python 3.10 within a Conda environment, ffmpeg 7.1.1 (via conda-forge), and pip install -e .. It integrates with LeRobot datasets, models, and robots, using YAML for configuration.
Highlighted Details
Maintenance & Community Built upon LeRobot and PEFT. No specific community channels or roadmap links are detailed in the README.
Licensing & Compatibility Released under the Apache 2.0 license, permitting commercial use and modification with standard attribution.
Limitations & Caveats QLoRA fine-tuning for policies under 8GB GPU memory is listed as a future development item (TODO). Optimization for lower-end GPUs remains a focus.
1 month ago
Inactive
Physical-Intelligence