vla_foundry by TRI-ML

Unified framework for training Vision-Language-Action models

Created 3 months ago

321 stars

Top 84.6% on SourcePulse

Project Summary

VLA Foundry is a unified framework designed for training Vision-Language-Action (VLA) models, enabling seamless progression from Large Language Models (LLMs) to Vision-Language Models (VLMs) and finally to VLAs within a single environment. It targets researchers and engineers working with multi-modal AI, offering a flexible and efficient platform to streamline complex training pipelines without external dependencies. The framework's modular design and support for multi-node training accelerate development and deployment of advanced AI agents.

How It Works

The framework employs a modular, pure PyTorch architecture, facilitating easy modification and extension. It supports training across multiple modalities, including text, image-captions, and robotics data. VLA Foundry integrates with Hugging Face, allowing users to leverage pre-trained weights for LLMs, VLMs, and CLIP models. For distributed training, it utilizes FSDP2 and streams datasets via WebDatasets, supporting multi-node setups on clusters like AWS SageMaker and local multi-GPU configurations with torchrun. Dataset mixing is supported, allowing specification of sources and ratios during dataloading for balanced batching.

Quick Start & Requirements

Installation is recommended using uv for environment management. After installing uv, create a Python 3.12 virtual environment and install dependencies with: uv sync uv pip install -e . The recommended workflow is uv run <script> <args>. A quickstart example command for training a VLM is provided, requiring AWS credentials for S3 data access and Hugging Face tokens.

Highlighted Details

Supports end-to-end training pipelines for LLM, VLM, and VLA models.
Multi-modal data handling (text, image-captions, robotics).
Multi-node training leveraging FSDP2 and WebDatasets for efficient data streaming.
Modular design with pure PyTorch components and seamless Hugging Face model integration.
Flexible dataset mixing and weighting capabilities.
Robust argument parsing via draccus with nested parameters and YAML preset includes.

Maintenance & Community

The repository includes contribution guidelines (CONTRIBUTING.md) and a troubleshooting FAQ (FAQ.md). Specific community channels (e.g., Discord, Slack) or a public roadmap are not explicitly mentioned in the README.

Licensing & Compatibility

The README does not specify a software license. This absence is a critical factor for adoption, as it leaves the terms of use, distribution, and modification undefined, potentially restricting commercial or closed-source integration.

Limitations & Caveats

A known limitation exists with YAML configuration includes, where overriding nested parameters within an included file is not straightforward and may require redefining all parameters or using command-line arguments. Additionally, tests requiring AWS S3 access are not pre-configured, recommending the use of small, local datasets for testing purposes.

vla_foundry by TRI-ML

Explore Similar Projects

LightRFT by opendilab

vla-scratch by EGalahad

awesome-huge-models by zhengzangw

FluxVLA by FluxVLA

unified_video_action by ShuangLI59

mistral by stanford-crfm

molmo by allenai

Vary by Ucas-HaoranWei

open_flamingo by mlfoundations

EasyR1 by hiyouga

FlagAI by FlagAI-Open

transformers by huggingface