MILS  by facebookresearch

Research paper implementation for multimodal LLM understanding

Created 8 months ago
450 stars

Top 66.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation for "LLMs can see and hear without any training," enabling large language models to perform multimodal tasks like image, audio, and video captioning, as well as image generation and style transfer, without task-specific training. It targets researchers and practitioners in multimodal AI.

How It Works

The MILS approach leverages pre-trained multimodal models (like ViClip) to extract embeddings from various modalities (images, audio, video). These embeddings are then converted into textual descriptions using a language model. These text descriptions serve as prompts for a generative LLM to produce captions or new images, effectively enabling "zero-shot" multimodal understanding and generation.

Quick Start & Requirements

  • Installation: Create and activate a conda environment using conda env create -f environment.yml and conda activate MILS.
  • Prerequisites: Requires datasets (MS-COCO, Clotho, MSR-VTT), specific checkpoints (ViClip-InternVid-10M-FLT.pth), and Python. The code is designed for inference on a single A100 GPU, with examples showing distributed inference across 8 A100 GPUs.
  • Setup: Involves downloading multiple large datasets and checkpoints, and updating paths in paths.py.
  • Resources: Requires significant GPU memory and storage for datasets.
  • Documentation: Installation and usage instructions are detailed in the README.

Highlighted Details

  • Enables LLMs to perform image, audio, and video captioning without task-specific fine-tuning.
  • Supports high-quality image generation and style transfer based on multimodal inputs.
  • Facilitates cross-modal arithmetic by converting modalities to text prompts for LLM-driven generation.

Maintenance & Community

  • The project is associated with Facebook Research.
  • Issues can be reported via the GitHub repository or by emailing kumar.ashutosh@utexas.edu.
  • Contribution guidelines are available in the CONTRIBUTING file.

Licensing & Compatibility

  • License: CC-by-NC 4.0.
  • Restrictions: Non-commercial use only. Third-party content is subject to its own licenses.

Limitations & Caveats

The code is primarily for inference and requires substantial setup involving downloading large datasets and specific model checkpoints. The CC-by-NC 4.0 license restricts commercial applications.

Health Check
Last Commit

4 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.