OneLLM  by csuhan

Multimodal research paper aligning modalities with language

Created 1 year ago
653 stars

Top 51.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

OneLLM is a framework designed to align multiple modalities (images, video, audio, point clouds, etc.) with language, enabling unified multimodal understanding and generation. It targets researchers and practitioners in multimodal AI, offering a single architecture to handle diverse data types for tasks like multimodal chat and instruction following.

How It Works

OneLLM leverages a unified multimodal encoder and a language model backbone (based on Llama 2) to process and integrate information from various modalities. It employs a staged pre-training approach, starting with image-text alignment, then progressing to video, audio, and point cloud data, and finally incorporating specialized modalities like depth, normal, IMU, and fMRI. This staged alignment allows the model to progressively learn complex cross-modal relationships.

Quick Start & Requirements

  • Install via git clone and pip install -r requirements.txt.
  • Requires Python 3.9, PyTorch.
  • Optional: Apex for mixed-precision training.
  • Pre-trained model weights are available on Hugging Face (csuhan/OneLLM-7B).
  • Demo and detailed data/evaluation/training instructions are provided.

Highlighted Details

  • Accepted to CVPR 2024.
  • Supports alignment of image, video, audio, point clouds, depth, normal, IMU, and fMRI data with language.
  • Offers both Gradio and CLI demos for local interaction.
  • Provides scripts for single-node and multi-node (DDP, SLURM) training.

Maintenance & Community

  • Project is associated with CVPR 2024.
  • Model weights and inference code released in December 2023.
  • No explicit community channels (Discord/Slack) are mentioned in the README.

Licensing & Compatibility

  • Developed based on Llama 2; subject to the Llama 2 Community License.
  • Commercial use and closed-source linking are subject to the terms of the Llama 2 Community License.

Limitations & Caveats

The project is based on Llama 2, implying potential restrictions on commercial use. The README does not detail specific hardware requirements beyond GPU usage for demos and training, nor does it specify performance benchmarks.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

X-LLM by phellonchen

0%
314
Multimodal LLM research paper
Created 2 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.