OneLLM  by csuhan

Multimodal research paper aligning modalities with language

created 1 year ago
651 stars

Top 52.2% on sourcepulse

GitHubView on GitHub
Project Summary

OneLLM is a framework designed to align multiple modalities (images, video, audio, point clouds, etc.) with language, enabling unified multimodal understanding and generation. It targets researchers and practitioners in multimodal AI, offering a single architecture to handle diverse data types for tasks like multimodal chat and instruction following.

How It Works

OneLLM leverages a unified multimodal encoder and a language model backbone (based on Llama 2) to process and integrate information from various modalities. It employs a staged pre-training approach, starting with image-text alignment, then progressing to video, audio, and point cloud data, and finally incorporating specialized modalities like depth, normal, IMU, and fMRI. This staged alignment allows the model to progressively learn complex cross-modal relationships.

Quick Start & Requirements

  • Install via git clone and pip install -r requirements.txt.
  • Requires Python 3.9, PyTorch.
  • Optional: Apex for mixed-precision training.
  • Pre-trained model weights are available on Hugging Face (csuhan/OneLLM-7B).
  • Demo and detailed data/evaluation/training instructions are provided.

Highlighted Details

  • Accepted to CVPR 2024.
  • Supports alignment of image, video, audio, point clouds, depth, normal, IMU, and fMRI data with language.
  • Offers both Gradio and CLI demos for local interaction.
  • Provides scripts for single-node and multi-node (DDP, SLURM) training.

Maintenance & Community

  • Project is associated with CVPR 2024.
  • Model weights and inference code released in December 2023.
  • No explicit community channels (Discord/Slack) are mentioned in the README.

Licensing & Compatibility

  • Developed based on Llama 2; subject to the Llama 2 Community License.
  • Commercial use and closed-source linking are subject to the terms of the Llama 2 Community License.

Limitations & Caveats

The project is based on Llama 2, implying potential restrictions on commercial use. The README does not detail specific hardware requirements beyond GPU usage for demos and training, nor does it specify performance benchmarks.

Health Check
Last commit

9 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.