LanguageBind by PKU-YuanGroup

Multimodal pretraining research paper using language-based semantic alignment

Created 2 years ago

859 stars

Top 41.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Jesse Clark

Cofounder of Marqo

Project Summary

LanguageBind is a multimodal pretraining framework that extends video-language models to N-modalities using language as a unifying semantic bridge. It enables zero-shot cross-modal retrieval and classification across diverse data types like video, audio, depth, and thermal imagery, targeting researchers and developers in multimodal AI.

How It Works

LanguageBind employs a language-centric approach, aligning various modalities (video, audio, depth, thermal) to a shared semantic space defined by language. This is achieved by using modality-specific encoders that project data into a common embedding space, allowing for direct comparison and retrieval between different data types and text. The framework leverages a large-scale dataset (VIDAL-10M) and enhances language descriptions with ChatGPT for richer semantic alignment.

Quick Start & Requirements

Install via pip: pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116 followed by pip install -r requirements.txt.
Prerequisites: Python >= 3.8, PyTorch >= 1.13.1, CUDA >= 11.6.
Local demo: python gradio_app.py.
Online demo: Huggingface Spaces.
Official Docs: DATASETS.md, TRAIN_AND_VALIDATE.md.

Highlighted Details

Achieves state-of-the-art (SOTA) performance on multiple zero-shot cross-modal tasks, including video-language, audio-language, depth-language, and thermal-language.
Introduces VIDAL-10M, a 10 million sample dataset encompassing video, infrared, depth, audio, and language modalities.
Supports flexible extension to unlimited modalities by adding new encoders.
Offers pre-trained models for various modalities on Hugging Face Hub.

Maintenance & Community

Accepted at ICLR 2024.
Active development with recent updates to the VIDAL dataset and model releases (e.g., MoE-LLaVA).
Links to related projects like Video-LLaVA and Video-Bench are provided.

Licensing & Compatibility

Code License: MIT.
Dataset License: CC-BY-NC 4.0 (Non-commercial use).

Limitations & Caveats

The dataset is licensed for non-commercial use only, which may restrict commercial applications. The image encoder is initialized from OpenCLIP and not fine-tuned to the same extent as other modalities.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days