MILS by facebookresearch

Research paper implementation for multimodal LLM understanding

Created 1 year ago

458 stars

Top 66.0% on SourcePulse

Project Summary

This repository provides the official implementation for "LLMs can see and hear without any training," enabling large language models to perform multimodal tasks like image, audio, and video captioning, as well as image generation and style transfer, without task-specific training. It targets researchers and practitioners in multimodal AI.

How It Works

The MILS approach leverages pre-trained multimodal models (like ViClip) to extract embeddings from various modalities (images, audio, video). These embeddings are then converted into textual descriptions using a language model. These text descriptions serve as prompts for a generative LLM to produce captions or new images, effectively enabling "zero-shot" multimodal understanding and generation.

Quick Start & Requirements

Installation: Create and activate a conda environment using conda env create -f environment.yml and conda activate MILS.
Prerequisites: Requires datasets (MS-COCO, Clotho, MSR-VTT), specific checkpoints (ViClip-InternVid-10M-FLT.pth), and Python. The code is designed for inference on a single A100 GPU, with examples showing distributed inference across 8 A100 GPUs.
Setup: Involves downloading multiple large datasets and checkpoints, and updating paths in paths.py.
Resources: Requires significant GPU memory and storage for datasets.
Documentation: Installation and usage instructions are detailed in the README.

Highlighted Details

Enables LLMs to perform image, audio, and video captioning without task-specific fine-tuning.
Supports high-quality image generation and style transfer based on multimodal inputs.
Facilitates cross-modal arithmetic by converting modalities to text prompts for LLM-driven generation.

Maintenance & Community

The project is associated with Facebook Research.
Issues can be reported via the GitHub repository or by emailing kumar.ashutosh@utexas.edu.
Contribution guidelines are available in the CONTRIBUTING file.

Licensing & Compatibility

License: CC-by-NC 4.0.
Restrictions: Non-commercial use only. Third-party content is subject to its own licenses.

Limitations & Caveats

The code is primarily for inference and requires substantial setup involving downloading large datasets and specific model checkpoints. The CC-by-NC 4.0 license restricts commercial applications.

MILS by facebookresearch

Explore Similar Projects

ShareGPT-4o-Image by FreedomIntelligence

Liquid by FoundationVision

gill by kohjingyu

Chat-UniVi by PKU-YuanGroup

ComfyUI_VLM_nodes by gokayfem

Awesome-CLIP by yzhuoning

ELLA by TencentQQGYLab

guizang-s-prompt by op7418

Emu3 by baaivision

InternLM-XComposer by InternLM

text2video by bravekingzhang

NExT-GPT by NExT-GPT