VoRA  by Hon-Wong

MLLM with visual capabilities

created 4 months ago
322 stars

Top 85.5% on sourcepulse

GitHubView on GitHub
Project Summary

VoRA introduces a novel paradigm for integrating visual capabilities into Large Language Models (LLMs) by embedding vision-specific LoRA (Low-Rank Adaptation) layers directly within the LLM architecture. This encoder-free approach allows for seamless merging of visual parameters during inference, eliminating external module complexity and computational overhead. It targets researchers and developers aiming to create efficient multimodal LLMs (MLLMs) capable of processing arbitrary image resolutions and leveraging pre-trained visual knowledge.

How It Works

VoRA internalizes visual processing by injecting LoRA layers directly into the LLM, avoiding the need for separate vision encoders. This design facilitates parameter merging for inference, reducing complexity and computational cost. A block-wise distillation method transfers visual priors from pre-trained Vision Transformers (ViTs) into the LoRA layers, accelerating training. Bi-directional attention masks are employed to enhance context capture from images.

Quick Start & Requirements

  • Install: pip3 install -e . after cloning the repository.
  • Prerequisites: git-lfs for dataset cloning.
  • Data: Requires downloading large datasets (e.g., VoRA-Recap-8M, VoRA-Recap-29M) from Hugging Face.
  • Training: Supports distributed training via DeepSpeed and Torchrun.
  • Evaluation: Can be evaluated using LMMs-Eval.
  • Links: Official Website, arXiv Paper, Hugging Face Collection.

Highlighted Details

  • Encoder-free MLLM architecture.
  • Arbitrary resolution image processing.
  • Block-wise distillation for visual prior injection.
  • Bi-directional attention masks for improved context.

Maintenance & Community

  • Training code, weights, and data were released in April 2025.
  • LMMs-Eval supports VoRA.
  • No explicit community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

  • The repository is described as "[Fully open]" but the specific license type is not explicitly stated in the README. Further clarification may be needed for commercial use.

Limitations & Caveats

  • The README does not specify the base LLM used or provide explicit compatibility information for different LLM architectures.
  • The "Fully open" claim requires verification against the actual license file.
Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
179 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.