tapnet by google-deepmind

State-of-the-art point tracking for video and robotics

Created 3 years ago

1,774 stars

Top 24.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Travis Fischer

Founder of Agentic

Project Summary

Summary

Google DeepMind's TAPNet repository addresses the challenge of Tracking Any Point (TAP) in videos, offering state-of-the-art models like TAPIR and TAPNext. It provides datasets and benchmarks (TAP-Vid, TAPVid-3D, RoboTAP) crucial for advancing computer vision and robotics research, enabling precise point tracking across diverse scenarios.

How It Works

TAPIR employs a two-stage approach: frame-wise matching followed by temporal refinement, achieving high speed and accuracy. TAPNext reformulates the problem as next token prediction for a simpler, efficient tracker. BootsTAP enhances performance through self-supervised learning on unlabeled data, improving robustness to transformations and query variations.

Quick Start & Requirements

Installation is straightforward via pip install .. Key dependencies include JAX, with CUDA/cuDNN recommended for GPU acceleration. Colab notebooks offer immediate online demos. For local execution, clone the repository, install requirements, download checkpoints, and run the provided live demo script. Performance benchmarks indicate ~17 FPS on a Quadro RTX 4000 for 480x480 images.

Highlighted Details

Achieves state-of-the-art results on the TAP-Vid benchmark for class-agnostic, long-term point tracking.
Includes RoboTAP, extending TAPIR for robotics manipulation tasks through imitation learning.
Features TAPVid-3D, a benchmark for 3D point tracking in real-world videos.
BootsTAP methodology significantly improves tracking accuracy via self-supervised training.
TAPNext offers a streamlined, high-performance tracking architecture.

Maintenance & Community

Developed by Google DeepMind, the project is backed by multiple research publications, indicating active development and validation. No specific community channels (e.g., Discord, Slack) are listed in the README.

Licensing & Compatibility

Core software is licensed under Apache 2.0, permitting commercial use. TAPVid-3D has a separate license, while dataset annotations and videos (TAP-Vid, RoboTAP) are under CC-BY 4.0. Source videos (DAVIS, Kinetics) retain their original licenses. Users should verify compatibility for specific video sources.

Limitations & Caveats

The README does not detail explicit limitations, alpha status, or known bugs. Careful attention is required for coordinate system conventions (normalized raster vs. regular raster, (x, y) vs. (t, y, x)). Optimal performance and accuracy may depend on specific hardware and higher resolutions, potentially increasing computational demands.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

23 stars in the last 30 days