RLDX-1  by RLWRLD

Vision-Language-Action model for dexterous manipulation

Created 1 month ago
273 stars

Top 94.6% on SourcePulse

GitHubView on GitHub
Project Summary

RLDX-1 is a Vision-Language-Action (VLA) model designed for human-like dexterous manipulation in robotics. It targets researchers and engineers seeking advanced robotic control capabilities, offering enhanced motion awareness, long-term memory, and physical sensing beyond standard VLM models. The project provides a unified architecture and a robust training and inference pipeline for developing more capable and adaptable robotic agents.

How It Works

RLDX-1 employs a novel Multi-Stream Action Transformer (MSAT) architecture, an extension of MM-DiT, which dedicates separate streams for cognition, physics, and action, coupled by joint self-attention. This design enables crucial functional capabilities: motion awareness through multi-frame observations and a motion module, long-term memory via a dedicated memory module fusing historical and current features, and physical sensing by integrating tactile and torque data into the physics stream. A synthetic-augmented, three-stage training pipeline (pre-training, mid-training, post-training) enhances generalization and task adaptation.

Quick Start & Requirements

Installation involves cloning the repository, setting up the Python 3.10 environment with uv, and installing the package.

  • Primary install:
    git clone https://github.com/RLWRLD/RLDX-1.git
    cd RLDX-1
    uv sync --python 3.10
    uv pip install -e .
    
  • Prerequisites: Python 3.10, CUDA 12.x, uv v0.8.4+.
  • Documentation: Comprehensive guides are available in the docs/ directory, covering installation, architecture, training, and inference. Key links include the Paper and Project Page.

Highlighted Details

  • Achieves real-time inference speeds of 43.7 ms/step on an RTX 5090 (>22 Hz) using static graph capture and custom fused kernels.
  • Demonstrates state-of-the-art performance across various benchmarks, including LIBERO (97.8% avg), LIBERO-Plus (86.7%), SIMPLER variants, RoboCasa Kitchen (70.6%), GR-1 Tabletop (58.7%), and RoboCasa365 (32.1%).
  • The MSAT architecture integrates distinct streams for cognition, physics, and action, allowing for richer state representation and control.
  • Synthetic data augmentation is utilized to improve performance on rare manipulation scenarios.

Maintenance & Community

The project explicitly states that external pull requests are not accepted. Users encountering bugs or having questions are directed to open issues on the GitHub repository for follow-up. The project builds upon NVIDIA GR00T N1.7, Qwen3-VL, and FLUX.

Licensing & Compatibility

The codebase is released under the Apache License 2.0. However, the model weights are distributed under the RLWRLD Model License v1.0, which is a non-commercial license requiring attribution and share-alike terms. This restricts the use of pre-trained and mid-trained checkpoints to non-commercial applications.

Limitations & Caveats

The non-commercial license for model weights is a significant restriction for potential adopters. Furthermore, the fullgraph inference optimization mode is tuned for RTX 5090 (sm_120) architectures, and users with different GPU architectures may need to rely on the submodule compile mode for optimal results. External contributions to the codebase are not accepted.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
112 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.