vla_foundry  by TRI-ML

Unified framework for training Vision-Language-Action models

Created 3 months ago
321 stars

Top 84.6% on SourcePulse

GitHubView on GitHub
Project Summary

VLA Foundry is a unified framework designed for training Vision-Language-Action (VLA) models, enabling seamless progression from Large Language Models (LLMs) to Vision-Language Models (VLMs) and finally to VLAs within a single environment. It targets researchers and engineers working with multi-modal AI, offering a flexible and efficient platform to streamline complex training pipelines without external dependencies. The framework's modular design and support for multi-node training accelerate development and deployment of advanced AI agents.

How It Works

The framework employs a modular, pure PyTorch architecture, facilitating easy modification and extension. It supports training across multiple modalities, including text, image-captions, and robotics data. VLA Foundry integrates with Hugging Face, allowing users to leverage pre-trained weights for LLMs, VLMs, and CLIP models. For distributed training, it utilizes FSDP2 and streams datasets via WebDatasets, supporting multi-node setups on clusters like AWS SageMaker and local multi-GPU configurations with torchrun. Dataset mixing is supported, allowing specification of sources and ratios during dataloading for balanced batching.

Quick Start & Requirements

Installation is recommended using uv for environment management. After installing uv, create a Python 3.12 virtual environment and install dependencies with: uv sync uv pip install -e . The recommended workflow is uv run <script> <args>. A quickstart example command for training a VLM is provided, requiring AWS credentials for S3 data access and Hugging Face tokens.

Highlighted Details

  • Supports end-to-end training pipelines for LLM, VLM, and VLA models.
  • Multi-modal data handling (text, image-captions, robotics).
  • Multi-node training leveraging FSDP2 and WebDatasets for efficient data streaming.
  • Modular design with pure PyTorch components and seamless Hugging Face model integration.
  • Flexible dataset mixing and weighting capabilities.
  • Robust argument parsing via draccus with nested parameters and YAML preset includes.

Maintenance & Community

The repository includes contribution guidelines (CONTRIBUTING.md) and a troubleshooting FAQ (FAQ.md). Specific community channels (e.g., Discord, Slack) or a public roadmap are not explicitly mentioned in the README.

Licensing & Compatibility

The README does not specify a software license. This absence is a critical factor for adoption, as it leaves the terms of use, distribution, and modification undefined, potentially restricting commercial or closed-source integration.

Limitations & Caveats

A known limitation exists with YAML configuration includes, where overriding nested parameters within an included file is not straightforward and may require redefining all parameters or using command-line arguments. Additionally, tests requiring AWS S3 access are not pre-configured, recommending the use of small, local datasets for testing purposes.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
5
Issues (30d)
4
Star History
322 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.0%
4k
Open-source framework for training large multimodal models
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.