X-VLA by 2toinf

Robotic control model using soft-prompted Transformers for cross-embodiment generalization

Created 5 months ago

548 stars

Top 58.4% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> X-VLA addresses the challenge of creating scalable, generalizable Vision-Language-Action (VLA) models for robotic control across diverse embodiments. It offers a unified Transformer architecture with soft prompts, enabling robust deployment in both simulation and real-world systems. This approach benefits researchers and engineers by providing a high-performance, adaptable VLA solution for heterogeneous robotic platforms.

How It Works

The core of X-VLA is a unified Transformer backbone augmented with embodiment-specific soft prompts—learnable embeddings that guide multi-domain policy learning. This design decouples the general policy model from embodiment-specific details, facilitating cross-embodiment generalization. A Server-Client architecture further enhances deployment flexibility, separating the model from environment dependencies and supporting distributed inference. This approach achieves state-of-the-art performance across various robotic platforms.

Quick Start & Requirements

Installation involves cloning the repository, creating a Python 3.10 Conda environment, and installing dependencies via pip install -r requirements.txt. The project supports inference via a Server-Client architecture, with pre-trained models available on Hugging Face. Links to the paper, project page, and Hugging Face models are provided.

Highlighted Details

Championed the AgiBot World Challenge at IROS 2025.
Demonstrates state-of-the-art generalization across six simulation and three real-world robotic platforms.
Employs a standardized EE6D (End-Effector 6D) control space for consistent action representation.
Offers LoRA fine-tuning capabilities with released checkpoints and inference code.

Maintenance & Community

The project is maintained by 2toINF. Feedback, issues, and contributions are welcomed via GitHub Discussions and Pull Requests. Specific community channels like Discord or Slack are not detailed in the README.

Licensing & Compatibility

X-VLA is licensed under the Apache License 2.0, permitting free use, modification, and distribution, including for commercial purposes.

Limitations & Caveats

A slight performance drop (around 1%) was noted after converting models to Hugging Face format, which is under investigation. Guidance for converting relative to absolute actions requires consulting specific GitHub issues. Evaluation guidance for the VLABench is pending updates.

X-VLA by 2toinf

Explore Similar Projects

Large-VLM-based-VLA-for-Robotic-Manipulation by JiuTian-VL

Being-H by BeingBeyond

RoboBrain by FlagOpen

vla0 by NVlabs

embodied-agents by mbodiai

CogACT by microsoft

RDT2 by thu-ml

SpatialVLA by SpatialVLA

lingbot-vla by Robbyant

every-embodied by datawhalechina

Awesome-Robotics-Foundation-Models by robotics-survey

octo by octo-models