LanguageBind  by PKU-YuanGroup

Multimodal pretraining research paper using language-based semantic alignment

created 1 year ago
819 stars

Top 44.2% on sourcepulse

GitHubView on GitHub
Project Summary

LanguageBind is a multimodal pretraining framework that extends video-language models to N-modalities using language as a unifying semantic bridge. It enables zero-shot cross-modal retrieval and classification across diverse data types like video, audio, depth, and thermal imagery, targeting researchers and developers in multimodal AI.

How It Works

LanguageBind employs a language-centric approach, aligning various modalities (video, audio, depth, thermal) to a shared semantic space defined by language. This is achieved by using modality-specific encoders that project data into a common embedding space, allowing for direct comparison and retrieval between different data types and text. The framework leverages a large-scale dataset (VIDAL-10M) and enhances language descriptions with ChatGPT for richer semantic alignment.

Quick Start & Requirements

  • Install via pip: pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116 followed by pip install -r requirements.txt.
  • Prerequisites: Python >= 3.8, PyTorch >= 1.13.1, CUDA >= 11.6.
  • Local demo: python gradio_app.py.
  • Online demo: Huggingface Spaces.
  • Official Docs: DATASETS.md, TRAIN_AND_VALIDATE.md.

Highlighted Details

  • Achieves state-of-the-art (SOTA) performance on multiple zero-shot cross-modal tasks, including video-language, audio-language, depth-language, and thermal-language.
  • Introduces VIDAL-10M, a 10 million sample dataset encompassing video, infrared, depth, audio, and language modalities.
  • Supports flexible extension to unlimited modalities by adding new encoders.
  • Offers pre-trained models for various modalities on Hugging Face Hub.

Maintenance & Community

  • Accepted at ICLR 2024.
  • Active development with recent updates to the VIDAL dataset and model releases (e.g., MoE-LLaVA).
  • Links to related projects like Video-LLaVA and Video-Bench are provided.

Licensing & Compatibility

  • Code License: MIT.
  • Dataset License: CC-BY-NC 4.0 (Non-commercial use).

Limitations & Caveats

The dataset is licensed for non-commercial use only, which may restrict commercial applications. The image encoder is initialized from OpenCLIP and not fine-tuned to the same extent as other modalities.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.