Discover and explore top open-source AI tools and projects—updated daily.
appleEgocentric video dataset for dexterous manipulation learning
Top 96.5% on SourcePulse
Summary
EgoDex provides a large-scale dataset and benchmark for learning dexterous manipulation from egocentric video. Aimed at researchers and engineers in robotics and embodied AI, it enables training and evaluation of models using high-fidelity, first-person perspectives of human manipulation tasks.
How It Works
The project comprises 829 hours of 30 Hz, 1080p egocentric video captured via ARKit on Apple Vision Pro, focusing on tabletop manipulation. It pairs video with detailed 3D pose annotations (SE(3) transforms for head, body, hands, and fingers) and natural language descriptions generated by LLMs/VLMs. This rich data facilitates training models for complex manipulation tasks by providing a realistic human-centric viewpoint.
Quick Start & Requirements
Setup involves creating a Python 3.11 Conda environment, installing ffmpeg (7.1.1), and running pip install -r requirements.txt. The primary challenge is downloading the dataset, which spans over 1.5 TB across multiple large zip files for training and testing. Sample scripts (simple_dataset.py, visualize_2d.py, visualize_3d.py, compute_metrics.py) are provided for data loading, visualization, and evaluation.
Highlighted Details
Maintenance & Community
The project is associated with the research paper "EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video." Notable community projects leveraging EgoDex include the EgoDex Viewer (Gradio app), H-RDT (Human-to-Robotics Diffusion Transformer), and Being-H0 (VLA model).
Licensing & Compatibility
The code is released under terms detailed in the repository's LICENSE file. The dataset is licensed under CC-by-NC-ND (Creative Commons Attribution-NonCommercial-NoDerivatives), strictly limiting its use to non-commercial, research purposes and prohibiting derivative works.
Limitations & Caveats
Natural language annotations generated by LLMs/VLMs may contain inaccuracies. 2D re-projections of 3D pose data can exhibit perspective mismatches with the RGB video. The ARKit origin frame, while consistent within an episode, is not standardized across different recordings, requiring careful handling for cross-episode analysis.
9 months ago
Inactive