MILS  by facebookresearch

Research paper implementation for multimodal LLM understanding

created 6 months ago
447 stars

Top 68.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation for "LLMs can see and hear without any training," enabling large language models to perform multimodal tasks like image, audio, and video captioning, as well as image generation and style transfer, without task-specific training. It targets researchers and practitioners in multimodal AI.

How It Works

The MILS approach leverages pre-trained multimodal models (like ViClip) to extract embeddings from various modalities (images, audio, video). These embeddings are then converted into textual descriptions using a language model. These text descriptions serve as prompts for a generative LLM to produce captions or new images, effectively enabling "zero-shot" multimodal understanding and generation.

Quick Start & Requirements

  • Installation: Create and activate a conda environment using conda env create -f environment.yml and conda activate MILS.
  • Prerequisites: Requires datasets (MS-COCO, Clotho, MSR-VTT), specific checkpoints (ViClip-InternVid-10M-FLT.pth), and Python. The code is designed for inference on a single A100 GPU, with examples showing distributed inference across 8 A100 GPUs.
  • Setup: Involves downloading multiple large datasets and checkpoints, and updating paths in paths.py.
  • Resources: Requires significant GPU memory and storage for datasets.
  • Documentation: Installation and usage instructions are detailed in the README.

Highlighted Details

  • Enables LLMs to perform image, audio, and video captioning without task-specific fine-tuning.
  • Supports high-quality image generation and style transfer based on multimodal inputs.
  • Facilitates cross-modal arithmetic by converting modalities to text prompts for LLM-driven generation.

Maintenance & Community

  • The project is associated with Facebook Research.
  • Issues can be reported via the GitHub repository or by emailing kumar.ashutosh@utexas.edu.
  • Contribution guidelines are available in the CONTRIBUTING file.

Licensing & Compatibility

  • License: CC-by-NC 4.0.
  • Restrictions: Non-commercial use only. Third-party content is subject to its own licenses.

Limitations & Caveats

The code is primarily for inference and requires substantial setup involving downloading large datasets and specific model checkpoints. The CC-by-NC 4.0 license restricts commercial applications.

Health Check
Last commit

2 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
23 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.