ImageBind by facebookresearch

PyTorch implementation for multimodal embeddings research paper

Created 2 years ago

8,935 stars

Top 5.7% on SourcePulse

View on GitHub

10 Experts Love This Project

Xiaofan Luan

VP Engineering at Zilliz

Wing Lian

Founder of Axolotl AI

Junyang Lin

Core Maintainer at Alibaba Qwen

Travis Fischer

Founder of Agentic

and 6 more!

Project Summary

ImageBind unifies embeddings across six modalities (image, text, audio, depth, thermal, IMU), enabling novel cross-modal applications like retrieval and generation. It's designed for researchers and developers working with multimodal AI who need a unified representation for diverse data types.

How It Works

ImageBind learns a joint embedding space by leveraging a modality-agnostic approach. It uses a contrastive learning objective to align representations from different modalities without requiring explicit cross-modal supervision. This allows for emergent zero-shot capabilities across modalities, as demonstrated by its performance on various benchmarks.

Quick Start & Requirements

Install via pip: pip install . (after setting up a conda environment with Python 3.10 and PyTorch 1.13+).
Windows users may need pip install soundfile.
Requires PyTorch 1.13+ and CUDA if using GPU.
See official quick-start for detailed examples.

Highlighted Details

Achieves emergent zero-shot classification performance on datasets like IN1k, K400, NYU-D, ESC, LLVIP, and Ego4D.
Enables cross-modal retrieval, modality arithmetic, detection, and generation.
Supports image, text, and audio modalities out-of-the-box, with others available.

Maintenance & Community

Developed by Meta AI (FAIR).
Paper accepted to CVPR 2023 (highlighted paper).
Links to paper, blog, demo, and supplementary video are provided.

Licensing & Compatibility

Released under the CC-BY-NC 4.0 license.
Non-commercial use only.

Limitations & Caveats

The CC-BY-NC 4.0 license restricts commercial use. The README does not detail support for all six modalities in the provided code snippet, focusing on image, text, and audio.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

47 stars in the last 30 days