RLDX-1 by RLWRLD

Vision-Language-Action model for dexterous manipulation

Created 3 months ago

310 stars

Top 86.7% on SourcePulse

Project Summary

RLDX-1 is a Vision-Language-Action (VLA) model designed for human-like dexterous manipulation in robotics. It targets researchers and engineers seeking advanced robotic control capabilities, offering enhanced motion awareness, long-term memory, and physical sensing beyond standard VLM models. The project provides a unified architecture and a robust training and inference pipeline for developing more capable and adaptable robotic agents.

How It Works

RLDX-1 employs a novel Multi-Stream Action Transformer (MSAT) architecture, an extension of MM-DiT, which dedicates separate streams for cognition, physics, and action, coupled by joint self-attention. This design enables crucial functional capabilities: motion awareness through multi-frame observations and a motion module, long-term memory via a dedicated memory module fusing historical and current features, and physical sensing by integrating tactile and torque data into the physics stream. A synthetic-augmented, three-stage training pipeline (pre-training, mid-training, post-training) enhances generalization and task adaptation.

Quick Start & Requirements

Installation involves cloning the repository, setting up the Python 3.10 environment with uv, and installing the package.

Primary install:

git clone https://github.com/RLWRLD/RLDX-1.git
cd RLDX-1
uv sync --python 3.10
uv pip install -e .

Prerequisites: Python 3.10, CUDA 12.x, uv v0.8.4+.
Documentation: Comprehensive guides are available in the docs/ directory, covering installation, architecture, training, and inference. Key links include the Paper and Project Page.

Highlighted Details

Achieves real-time inference speeds of 43.7 ms/step on an RTX 5090 (>22 Hz) using static graph capture and custom fused kernels.
Demonstrates state-of-the-art performance across various benchmarks, including LIBERO (97.8% avg), LIBERO-Plus (86.7%), SIMPLER variants, RoboCasa Kitchen (70.6%), GR-1 Tabletop (58.7%), and RoboCasa365 (32.1%).
The MSAT architecture integrates distinct streams for cognition, physics, and action, allowing for richer state representation and control.
Synthetic data augmentation is utilized to improve performance on rare manipulation scenarios.

Maintenance & Community

The project explicitly states that external pull requests are not accepted. Users encountering bugs or having questions are directed to open issues on the GitHub repository for follow-up. The project builds upon NVIDIA GR00T N1.7, Qwen3-VL, and FLUX.

Licensing & Compatibility

The codebase is released under the Apache License 2.0. However, the model weights are distributed under the RLWRLD Model License v1.0, which is a non-commercial license requiring attribution and share-alike terms. This restricts the use of pre-trained and mid-trained checkpoints to non-commercial applications.

Limitations & Caveats

The non-commercial license for model weights is a significant restriction for potential adopters. Furthermore, the fullgraph inference optimization mode is tuned for RTX 5090 (sm_120) architectures, and users with different GPU architectures may need to rely on the submodule compile mode for optimal results. External contributions to the codebase are not accepted.

Health Check

Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

24 stars in the last 30 days