UniTR  by Haiyang-W

Research paper for multi-modal 3D perception using unified transformer

Created 2 years ago
337 stars

Top 81.6% on SourcePulse

GitHubView on GitHub
Project Summary

UniTR presents a unified, weight-sharing transformer backbone for multi-modal 3D perception, specifically targeting Bird's-Eye-View (BEV) representation. It aims to improve efficiency and performance in autonomous driving scenarios by jointly processing camera and LiDAR data within a single architecture, benefiting researchers and engineers in 3D computer vision and autonomous systems.

How It Works

UniTR employs a modality-agnostic transformer encoder that processes diverse sensor inputs (cameras, LiDAR) with shared parameters. This approach enables parallel, modal-wise representation learning and automatic cross-modal interaction without explicit fusion steps, offering a task-agnostic foundation for various 3D perception tasks.

Quick Start & Requirements

  • Installation: Requires Python 3.8, PyTorch 1.10.1+cu113, torchvision 0.11.2+cu113, and nuscenes-devkit 1.0.5. Clone the repository and install dependencies via pip install -r requirements.txt.
  • Dataset: NuScenes dataset (v1.0-trainval or v1.0-mini) is required, with specific directory organization and data info generation steps outlined. Map expansion files are needed for BEV segmentation.
  • Training: Pre-trained checkpoints are available. Training commands are provided for 3D object detection and BEV map segmentation, supporting multi-GPU setups.
  • Testing: Commands for testing both tasks are available, including options for cached backbone computations to accelerate inference.
  • Links: NuScenes Dataset, OpenPCDet, Map Expansion

Highlighted Details

  • Achieves State-of-the-Art (SOTA) performance on nuScenes for 3D object detection (NDS 74.5) and BEV map segmentation.
  • Features a truly multimodal fusion backbone, seamlessly connectable to any 3D detection head.
  • Offers a caching mechanism that can reduce backbone inference latency by up to 40% for datasets with consistent sensor parameters.
  • Potential for integration with 2D vision foundation models like ViT due to architectural similarities.

Maintenance & Community

The project is associated with authors from PKU and other institutions. Recent updates include the release of GiT (ECCV2024 oral) and its integration potential with UniTR. The primary contact points are provided via email.

Licensing & Compatibility

The repository is released under the Apache 2.0 license, permitting commercial use and linking with closed-source projects.

Limitations & Caveats

The current implementation's data partitioning step consumes approximately 40% of the total time, with significant room for optimization. Cache mode currently only supports a batch size of 1. FP16 training may encounter NaN gradients.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

x-transformers by lucidrains

0.2%
6k
Transformer library with extensive experimental features
Created 4 years ago
Updated 5 days ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
13 more.

pytorch3d by facebookresearch

0.2%
10k
PyTorch3D is a PyTorch library for 3D deep learning research
Created 5 years ago
Updated 3 days ago
Feedback? Help us improve.