UniTR  by Haiyang-W

Research paper for multi-modal 3D perception using unified transformer

created 1 year ago
331 stars

Top 83.8% on sourcepulse

GitHubView on GitHub
Project Summary

UniTR presents a unified, weight-sharing transformer backbone for multi-modal 3D perception, specifically targeting Bird's-Eye-View (BEV) representation. It aims to improve efficiency and performance in autonomous driving scenarios by jointly processing camera and LiDAR data within a single architecture, benefiting researchers and engineers in 3D computer vision and autonomous systems.

How It Works

UniTR employs a modality-agnostic transformer encoder that processes diverse sensor inputs (cameras, LiDAR) with shared parameters. This approach enables parallel, modal-wise representation learning and automatic cross-modal interaction without explicit fusion steps, offering a task-agnostic foundation for various 3D perception tasks.

Quick Start & Requirements

  • Installation: Requires Python 3.8, PyTorch 1.10.1+cu113, torchvision 0.11.2+cu113, and nuscenes-devkit 1.0.5. Clone the repository and install dependencies via pip install -r requirements.txt.
  • Dataset: NuScenes dataset (v1.0-trainval or v1.0-mini) is required, with specific directory organization and data info generation steps outlined. Map expansion files are needed for BEV segmentation.
  • Training: Pre-trained checkpoints are available. Training commands are provided for 3D object detection and BEV map segmentation, supporting multi-GPU setups.
  • Testing: Commands for testing both tasks are available, including options for cached backbone computations to accelerate inference.
  • Links: NuScenes Dataset, OpenPCDet, Map Expansion

Highlighted Details

  • Achieves State-of-the-Art (SOTA) performance on nuScenes for 3D object detection (NDS 74.5) and BEV map segmentation.
  • Features a truly multimodal fusion backbone, seamlessly connectable to any 3D detection head.
  • Offers a caching mechanism that can reduce backbone inference latency by up to 40% for datasets with consistent sensor parameters.
  • Potential for integration with 2D vision foundation models like ViT due to architectural similarities.

Maintenance & Community

The project is associated with authors from PKU and other institutions. Recent updates include the release of GiT (ECCV2024 oral) and its integration potential with UniTR. The primary contact points are provided via email.

Licensing & Compatibility

The repository is released under the Apache 2.0 license, permitting commercial use and linking with closed-source projects.

Limitations & Caveats

The current implementation's data partitioning step consumes approximately 40% of the total time, with significant room for optimization. Cache mode currently only supports a batch size of 1. FP16 training may encounter NaN gradients.

Health Check
Last commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.