TesserAct by UMass-Embodied-AGI

4D embodied world models for robotics

Created 8 months ago

369 stars

Top 76.5% on SourcePulse

Project Summary

TesserAct is an open-source, generalized 4D world model for robotics, designed to generate RGB, depth, and normal videos from image and text instructions, enabling 4D scene reconstruction and action prediction. It targets researchers and practitioners in embodied AI and robotics seeking to build more capable and generalizable robotic agents.

How It Works

TesserAct leverages a diffusion-based approach, building upon CogVideoX, to learn 4D representations of the world. It processes image and text inputs to predict future video frames, including geometric information like depth and normals, facilitating a comprehensive understanding of the environment for robotic control. This approach allows for the generation of realistic and geometrically consistent video predictions.

Quick Start & Requirements

Installation: Create a conda environment (python=3.9), activate it, clone the repository, and install dependencies with pip install -r requirements.txt followed by pip install -e ..
Prerequisites: Python 3.9, CUDA (implied for deep learning models).
Data Preparation: Scripts for data generation are provided in DATA.md.
Inference: Run inference using python inference/inference_rgbdn_sft.py or python inference/inference_rgb_lora.py with specified weights and image paths.
Point Cloud Rendering: Requires Blender 4.3+ and PyBlend.
Resources: LoRA fine-tuning requires ~30GB GPU memory.
Documentation: Refer to USAGE.MD for detailed inference guidance.

Highlighted Details

ICCV 2025 accepted paper.
Offers LoRA fine-tuning for custom datasets (~100 videos) with efficient training (~2 days).
Provides RGB-only LoRA inference for enhanced generalization in robotics video generation.
Supports RGB+Depth+Normal generation and point cloud rendering from output videos.

Maintenance & Community

The project is associated with the UMass Embodied AGI Lab. Further community engagement details (e.g., Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The code and models are released for research purposes. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

LoRA fine-tuning is experimental and not fully tested. Normal data generation may have imperfections, with ongoing work to improve it using NormalCrafter. The full dataset is not yet released due to storage size constraints of float depth data.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days