ImageBind  by facebookresearch

PyTorch implementation for multimodal embeddings research paper

Created 2 years ago
8,935 stars

Top 5.7% on SourcePulse

GitHubView on GitHub
Project Summary

ImageBind unifies embeddings across six modalities (image, text, audio, depth, thermal, IMU), enabling novel cross-modal applications like retrieval and generation. It's designed for researchers and developers working with multimodal AI who need a unified representation for diverse data types.

How It Works

ImageBind learns a joint embedding space by leveraging a modality-agnostic approach. It uses a contrastive learning objective to align representations from different modalities without requiring explicit cross-modal supervision. This allows for emergent zero-shot capabilities across modalities, as demonstrated by its performance on various benchmarks.

Quick Start & Requirements

  • Install via pip: pip install . (after setting up a conda environment with Python 3.10 and PyTorch 1.13+).
  • Windows users may need pip install soundfile.
  • Requires PyTorch 1.13+ and CUDA if using GPU.
  • See official quick-start for detailed examples.

Highlighted Details

  • Achieves emergent zero-shot classification performance on datasets like IN1k, K400, NYU-D, ESC, LLVIP, and Ego4D.
  • Enables cross-modal retrieval, modality arithmetic, detection, and generation.
  • Supports image, text, and audio modalities out-of-the-box, with others available.

Maintenance & Community

  • Developed by Meta AI (FAIR).
  • Paper accepted to CVPR 2023 (highlighted paper).
  • Links to paper, blog, demo, and supplementary video are provided.

Licensing & Compatibility

  • Released under the CC-BY-NC 4.0 license.
  • Non-commercial use only.

Limitations & Caveats

The CC-BY-NC 4.0 license restricts commercial use. The README does not detail support for all six modalities in the provided code snippet, focusing on image, text, and audio.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
47 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
470
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 2 years ago
Feedback? Help us improve.