ml-egodex  by apple

Egocentric video dataset for dexterous manipulation learning

Created 9 months ago
264 stars

Top 96.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

EgoDex provides a large-scale dataset and benchmark for learning dexterous manipulation from egocentric video. Aimed at researchers and engineers in robotics and embodied AI, it enables training and evaluation of models using high-fidelity, first-person perspectives of human manipulation tasks.

How It Works

The project comprises 829 hours of 30 Hz, 1080p egocentric video captured via ARKit on Apple Vision Pro, focusing on tabletop manipulation. It pairs video with detailed 3D pose annotations (SE(3) transforms for head, body, hands, and fingers) and natural language descriptions generated by LLMs/VLMs. This rich data facilitates training models for complex manipulation tasks by providing a realistic human-centric viewpoint.

Quick Start & Requirements

Setup involves creating a Python 3.11 Conda environment, installing ffmpeg (7.1.1), and running pip install -r requirements.txt. The primary challenge is downloading the dataset, which spans over 1.5 TB across multiple large zip files for training and testing. Sample scripts (simple_dataset.py, visualize_2d.py, visualize_3d.py, compute_metrics.py) are provided for data loading, visualization, and evaluation.

Highlighted Details

  • Massive Dataset: 829 hours of high-resolution egocentric video.
  • Rich Annotations: Precise 3D skeletal pose (SE(3)), joint confidences, and LLM/VLM-generated task descriptions.
  • ARKit Origin Frame: All pose data is relative to a consistent, stationary world frame per recording session.
  • Pedagogical Code: Example scripts demonstrate data access, visualization, and metric calculation.

Maintenance & Community

The project is associated with the research paper "EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video." Notable community projects leveraging EgoDex include the EgoDex Viewer (Gradio app), H-RDT (Human-to-Robotics Diffusion Transformer), and Being-H0 (VLA model).

Licensing & Compatibility

The code is released under terms detailed in the repository's LICENSE file. The dataset is licensed under CC-by-NC-ND (Creative Commons Attribution-NonCommercial-NoDerivatives), strictly limiting its use to non-commercial, research purposes and prohibiting derivative works.

Limitations & Caveats

Natural language annotations generated by LLMs/VLMs may contain inaccuracies. 2D re-projections of 3D pose data can exhibit perspective mismatches with the RGB video. The ARKit origin frame, while consistent within an episode, is not standardized across different recordings, requiring careful handling for cross-episode analysis.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
1
Star History
25 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.