DreamLLM  by RunpeiDong

Multimodal LLM framework for comprehension and creation

created 1 year ago
452 stars

Top 67.7% on sourcepulse

GitHubView on GitHub
Project Summary

DreamLLM is a framework for building and researching Multimodal Large Language Models (MLLMs), focusing on the synergy between multimodal comprehension and creation. It enables zero-shot generalist capabilities by modeling raw multimodal data and interleaved documents, targeting researchers and developers in the MLLM space.

How It Works

DreamLLM operates on two core principles: generative modeling of language and image posteriors directly in the raw multimodal space, and fostering the generation of raw, interleaved documents that include text, images, and unstructured layouts. The framework, named ♾️ Omni, treats MLLMs as LLMs augmented with plugin modules (e.g., vision encoders like CLIP, diffusion decoders like Stable Diffusion) connected via projectors. This modular design allows for flexible integration of various LLM bases, multimodal encoders, and projection methods.

Quick Start & Requirements

  • Installation: Use the provided install.sh script. Example: bash install.sh --env_name=omni --py_ver=3.10 --cuda=118.
  • Prerequisites: Python 3.10+, CUDA (version specified during install), PyTorch. The script handles PyTorch installation based on the CUDA version or a provided torch_dir.
  • Resources: Requires a suitable environment for MLLM development and training, including significant storage for datasets and model checkpoints.
  • Documentation: Installation guide, project structure, dataset builders, training scripts, and configuration details are available within the repository.

Highlighted Details

  • ICLR 2024 Spotlight paper.
  • Supports a wide range of datasets including Laion, WebVid, and LLaVA.
  • Modular architecture allows customization of LLM base, vision encoders, and projectors.
  • Framework includes tools for dataset management, training script definition, and configuration.

Maintenance & Community

The project is associated with multiple authors from various institutions, indicating a research-driven development. Links to author pages are provided. No specific community channels (Discord/Slack) or roadmap are explicitly mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. Given the research context and typical open-source practices for such projects, it is likely to be permissive, but users should verify. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as a research framework, and specific details regarding production readiness, extensive testing, or long-term maintenance commitments are not detailed. Users should anticipate potential changes and the need for adaptation as it evolves.

Health Check
Last commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.