OneLLM by csuhan

Multimodal research paper aligning modalities with language

Created 2 years ago

666 stars

Top 50.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jesse Clark

Cofounder of Marqo

Project Summary

OneLLM is a framework designed to align multiple modalities (images, video, audio, point clouds, etc.) with language, enabling unified multimodal understanding and generation. It targets researchers and practitioners in multimodal AI, offering a single architecture to handle diverse data types for tasks like multimodal chat and instruction following.

How It Works

OneLLM leverages a unified multimodal encoder and a language model backbone (based on Llama 2) to process and integrate information from various modalities. It employs a staged pre-training approach, starting with image-text alignment, then progressing to video, audio, and point cloud data, and finally incorporating specialized modalities like depth, normal, IMU, and fMRI. This staged alignment allows the model to progressively learn complex cross-modal relationships.

Quick Start & Requirements

Install via git clone and pip install -r requirements.txt.
Requires Python 3.9, PyTorch.
Optional: Apex for mixed-precision training.
Pre-trained model weights are available on Hugging Face (csuhan/OneLLM-7B).
Demo and detailed data/evaluation/training instructions are provided.

Highlighted Details

Accepted to CVPR 2024.
Supports alignment of image, video, audio, point clouds, depth, normal, IMU, and fMRI data with language.
Offers both Gradio and CLI demos for local interaction.
Provides scripts for single-node and multi-node (DDP, SLURM) training.

Maintenance & Community

Project is associated with CVPR 2024.
Model weights and inference code released in December 2023.
No explicit community channels (Discord/Slack) are mentioned in the README.

Licensing & Compatibility

Developed based on Llama 2; subject to the Llama 2 Community License.
Commercial use and closed-source linking are subject to the terms of the Llama 2 Community License.

Limitations & Caveats

The project is based on Llama 2, implying potential restrictions on commercial use. The README does not detail specific hardware requirements beyond GPU usage for demos and training, nor does it specify performance benchmarks.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days