sam3.cpp by PABannier

Fast, portable image and video segmentation in C++

Created 3 months ago

333 stars

Top 82.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Georgi Gerganov

Author of llama.cpp, whisper.cpp

Project Summary

This project provides a highly portable C++ implementation of state-of-the-art image and video segmentation models, including Meta's SAM 2, SAM 2.1, SAM 3, and EdgeTAM. It targets engineers and researchers seeking efficient, dependency-light segmentation capabilities that run directly on CPU or Apple Metal GPUs, eliminating the need for Python, PyTorch, or CUDA drivers. The primary benefit is fast, on-device segmentation with significantly reduced resource requirements.

How It Works

The core of the project is a single C++ library (sam3.cpp and sam3.h) built upon the ggml tensor computation library. This architecture allows for efficient inference across various hardware, including CPU and Apple Metal GPUs. It supports multiple model families, including SAM 2/2.1 (Hiera backbone), SAM 3 (ViT backbone with text detection), and EdgeTAM (RepViT backbone optimized for mobile). The library emphasizes performance through aggressive 4-bit quantization, drastically reducing model sizes and memory footprints while maintaining high accuracy.

Quick Start & Requirements

Primary install/run command: Clone the repository recursively (git clone --recursive), then build using CMake (mkdir build && cd build && cmake .. && make -j).
Non-default prerequisites: C++14 compiler, CMake 3.14+. SDL2 and ImGui are optional for interactive GUI examples. Apple Metal GPU acceleration is automatic on macOS.
Links:
- Repository: https://github.com/PABannier/sam3.cpp
- Model Zoo (GGML format): https://huggingface.co/PABannier/sam3.cpp

Highlighted Details

Model Sizes: Achieves significant compression, with EdgeTAM models as small as 15 MB (4-bit quantized) and SAM 3 down to 673 MB (4-bit quantized).
Performance: EdgeTAM is claimed to be 22x faster than SAM 2 on mobile. Benchmarks on Apple M4 Pro show Metal GPU inference times as low as 0.4s/frame for EdgeTAM and 7.7s/frame for SAM 3 (including text detection and tracking).
Features: Supports text-prompted detection (SAM 3 only), point/box segmentation, and video tracking with a memory bank across all models.
Portability: Designed as a zero-dependency C++ library (beyond ggml and stb), making it highly portable.

Maintenance & Community

The README acknowledges Meta AI Research for the original models and the ggml library. No specific community channels (like Discord/Slack), active maintainer information, or roadmap details are provided.

Licensing & Compatibility

License type: MIT.
Compatibility: The MIT license is permissive, allowing for commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

Apple Metal GPU acceleration is exclusive to macOS. Text-prompted detection functionality is limited to the SAM 3 model family. Interactive GUI examples require SDL2, which may not be installed by default on all systems. Users may need to convert official PyTorch checkpoints to the GGML format using provided scripts.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days