tapnet  by google-deepmind

State-of-the-art point tracking for video and robotics

Created 3 years ago
1,719 stars

Top 24.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

Google DeepMind's TAPNet repository addresses the challenge of Tracking Any Point (TAP) in videos, offering state-of-the-art models like TAPIR and TAPNext. It provides datasets and benchmarks (TAP-Vid, TAPVid-3D, RoboTAP) crucial for advancing computer vision and robotics research, enabling precise point tracking across diverse scenarios.

How It Works

TAPIR employs a two-stage approach: frame-wise matching followed by temporal refinement, achieving high speed and accuracy. TAPNext reformulates the problem as next token prediction for a simpler, efficient tracker. BootsTAP enhances performance through self-supervised learning on unlabeled data, improving robustness to transformations and query variations.

Quick Start & Requirements

Installation is straightforward via pip install .. Key dependencies include JAX, with CUDA/cuDNN recommended for GPU acceleration. Colab notebooks offer immediate online demos. For local execution, clone the repository, install requirements, download checkpoints, and run the provided live demo script. Performance benchmarks indicate ~17 FPS on a Quadro RTX 4000 for 480x480 images.

Highlighted Details

  • Achieves state-of-the-art results on the TAP-Vid benchmark for class-agnostic, long-term point tracking.
  • Includes RoboTAP, extending TAPIR for robotics manipulation tasks through imitation learning.
  • Features TAPVid-3D, a benchmark for 3D point tracking in real-world videos.
  • BootsTAP methodology significantly improves tracking accuracy via self-supervised training.
  • TAPNext offers a streamlined, high-performance tracking architecture.

Maintenance & Community

Developed by Google DeepMind, the project is backed by multiple research publications, indicating active development and validation. No specific community channels (e.g., Discord, Slack) are listed in the README.

Licensing & Compatibility

Core software is licensed under Apache 2.0, permitting commercial use. TAPVid-3D has a separate license, while dataset annotations and videos (TAP-Vid, RoboTAP) are under CC-BY 4.0. Source videos (DAVIS, Kinetics) retain their original licenses. Users should verify compatibility for specific video sources.

Limitations & Caveats

The README does not detail explicit limitations, alpha status, or known bugs. Careful attention is required for coordinate system conventions (normalized raster vs. regular raster, (x, y) vs. (t, y, x)). Optimal performance and accuracy may depend on specific hardware and higher resolutions, potentially increasing computational demands.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
46 stars in the last 30 days

Explore Similar Projects

Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

BIG-bench by google

0.1%
3k
Collaborative benchmark for probing and extrapolating LLM capabilities
Created 4 years ago
Updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

text-to-text-transfer-transformer by google-research

0.1%
6k
Unified text-to-text transformer for NLP research
Created 6 years ago
Updated 6 months ago
Feedback? Help us improve.