Vision-language-action framework for dexterous grasping
Top 84.0% on sourcepulse
DexGraspVLA is a vision-language-action framework designed for general dexterous grasping in complex, real-world scenarios. It targets researchers and engineers in robotics and AI, offering a robust solution for zero-shot grasping with high success rates, even with unseen objects and under challenging conditions. The framework excels at long-horizon tasks requiring complex reasoning, human disturbance handling, and failure recovery.
How It Works
DexGraspVLA employs a hierarchical approach. A pre-trained vision-language model (Qwen2.5-VL-72B-Instruct) acts as the high-level task planner, interpreting natural language commands and scene context. A diffusion-based policy serves as the low-level action controller, learning dexterous grasping movements from demonstrations. This combination leverages the generalization capabilities of foundation models with the precise control offered by diffusion models for imitation learning.
Quick Start & Requirements
conda create -n dexgraspvla python=3.9
, conda activate dexgraspvla
), then run pip install -r requirements.txt
. Install SAM and Cutie following their respective instructions.grasp_demo_example.tar.gz
) is provided for understanding data format and training.Highlighted Details
dexgraspvla-controller-20250320
) for immediate use.Maintenance & Community
The project is associated with authors from institutions including Tsinghua University. Links to community channels are not explicitly provided in the README.
Licensing & Compatibility
The repository's license is not specified in the README. The hardware-related code is withheld due to intellectual property constraints.
Limitations & Caveats
The README explicitly states that hardware-related code is not open-sourced due to IP constraints, which may limit full replication of the inference setup. The planner component relies on large, potentially costly, foundation models.
1 month ago
Inactive