DreamLLM by RunpeiDong

Multimodal LLM framework for comprehension and creation

Created 2 years ago

458 stars

Top 66.0% on SourcePulse

View on GitHub

2 Experts Love This Project

Jiaming Song

Chief Scientist at Luma AI

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

DreamLLM is a framework for building and researching Multimodal Large Language Models (MLLMs), focusing on the synergy between multimodal comprehension and creation. It enables zero-shot generalist capabilities by modeling raw multimodal data and interleaved documents, targeting researchers and developers in the MLLM space.

How It Works

DreamLLM operates on two core principles: generative modeling of language and image posteriors directly in the raw multimodal space, and fostering the generation of raw, interleaved documents that include text, images, and unstructured layouts. The framework, named ♾️ Omni, treats MLLMs as LLMs augmented with plugin modules (e.g., vision encoders like CLIP, diffusion decoders like Stable Diffusion) connected via projectors. This modular design allows for flexible integration of various LLM bases, multimodal encoders, and projection methods.

Quick Start & Requirements

Installation: Use the provided install.sh script. Example: bash install.sh --env_name=omni --py_ver=3.10 --cuda=118.
Prerequisites: Python 3.10+, CUDA (version specified during install), PyTorch. The script handles PyTorch installation based on the CUDA version or a provided torch_dir.
Resources: Requires a suitable environment for MLLM development and training, including significant storage for datasets and model checkpoints.
Documentation: Installation guide, project structure, dataset builders, training scripts, and configuration details are available within the repository.

Highlighted Details

ICLR 2024 Spotlight paper.
Supports a wide range of datasets including Laion, WebVid, and LLaVA.
Modular architecture allows customization of LLM base, vision encoders, and projectors.
Framework includes tools for dataset management, training script definition, and configuration.

Maintenance & Community

The project is associated with multiple authors from various institutions, indicating a research-driven development. Links to author pages are provided. No specific community channels (Discord/Slack) or roadmap are explicitly mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. Given the research context and typical open-source practices for such projects, it is likely to be permissive, but users should verify. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as a research framework, and specific details regarding production readiness, extensive testing, or long-term maintenance commitments are not detailed. Users should anticipate potential changes and the need for adaptation as it evolves.

Health Check

Last Commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days