UniTR by Haiyang-W

Research paper for multi-modal 3D perception using unified transformer

Created 2 years ago

346 stars

Top 80.2% on SourcePulse

Project Summary

UniTR presents a unified, weight-sharing transformer backbone for multi-modal 3D perception, specifically targeting Bird's-Eye-View (BEV) representation. It aims to improve efficiency and performance in autonomous driving scenarios by jointly processing camera and LiDAR data within a single architecture, benefiting researchers and engineers in 3D computer vision and autonomous systems.

How It Works

UniTR employs a modality-agnostic transformer encoder that processes diverse sensor inputs (cameras, LiDAR) with shared parameters. This approach enables parallel, modal-wise representation learning and automatic cross-modal interaction without explicit fusion steps, offering a task-agnostic foundation for various 3D perception tasks.

Quick Start & Requirements

Installation: Requires Python 3.8, PyTorch 1.10.1+cu113, torchvision 0.11.2+cu113, and nuscenes-devkit 1.0.5. Clone the repository and install dependencies via pip install -r requirements.txt.
Dataset: NuScenes dataset (v1.0-trainval or v1.0-mini) is required, with specific directory organization and data info generation steps outlined. Map expansion files are needed for BEV segmentation.
Training: Pre-trained checkpoints are available. Training commands are provided for 3D object detection and BEV map segmentation, supporting multi-GPU setups.
Testing: Commands for testing both tasks are available, including options for cached backbone computations to accelerate inference.
Links: NuScenes Dataset, OpenPCDet, Map Expansion

Highlighted Details

Achieves State-of-the-Art (SOTA) performance on nuScenes for 3D object detection (NDS 74.5) and BEV map segmentation.
Features a truly multimodal fusion backbone, seamlessly connectable to any 3D detection head.
Offers a caching mechanism that can reduce backbone inference latency by up to 40% for datasets with consistent sensor parameters.
Potential for integration with 2D vision foundation models like ViT due to architectural similarities.

Maintenance & Community

The project is associated with authors from PKU and other institutions. Recent updates include the release of GiT (ECCV2024 oral) and its integration potential with UniTR. The primary contact points are provided via email.

Licensing & Compatibility

The repository is released under the Apache 2.0 license, permitting commercial use and linking with closed-source projects.

Limitations & Caveats

The current implementation's data partitioning step consumes approximately 40% of the total time, with significant room for optimization. Cache mode currently only supports a batch size of 1. FP16 training may encounter NaN gradients.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days