VoRA by Hon-Wong

MLLM with visual capabilities

Created 9 months ago

359 stars

Top 78.1% on SourcePulse

1 Expert Loves This Project

pgarbacki

Cofounder of Fireworks AI

Project Summary

VoRA introduces a novel paradigm for integrating visual capabilities into Large Language Models (LLMs) by embedding vision-specific LoRA (Low-Rank Adaptation) layers directly within the LLM architecture. This encoder-free approach allows for seamless merging of visual parameters during inference, eliminating external module complexity and computational overhead. It targets researchers and developers aiming to create efficient multimodal LLMs (MLLMs) capable of processing arbitrary image resolutions and leveraging pre-trained visual knowledge.

How It Works

VoRA internalizes visual processing by injecting LoRA layers directly into the LLM, avoiding the need for separate vision encoders. This design facilitates parameter merging for inference, reducing complexity and computational cost. A block-wise distillation method transfers visual priors from pre-trained Vision Transformers (ViTs) into the LoRA layers, accelerating training. Bi-directional attention masks are employed to enhance context capture from images.

Quick Start & Requirements

Install: pip3 install -e . after cloning the repository.
Prerequisites: git-lfs for dataset cloning.
Data: Requires downloading large datasets (e.g., VoRA-Recap-8M, VoRA-Recap-29M) from Hugging Face.
Training: Supports distributed training via DeepSpeed and Torchrun.
Evaluation: Can be evaluated using LMMs-Eval.
Links: Official Website, arXiv Paper, Hugging Face Collection.

Highlighted Details

Encoder-free MLLM architecture.
Arbitrary resolution image processing.
Block-wise distillation for visual prior injection.
Bi-directional attention masks for improved context.

Maintenance & Community

Training code, weights, and data were released in April 2025.
LMMs-Eval supports VoRA.
No explicit community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

The repository is described as "[Fully open]" but the specific license type is not explicitly stated in the README. Further clarification may be needed for commercial use.

Limitations & Caveats

The README does not specify the base LLM used or provide explicit compatibility information for different LLM architectures.
The "Fully open" claim requires verification against the actual license file.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

5 stars in the last 30 days

Explore Similar Projects

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI) and

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

DreamLLM by RunpeiDong

Multimodal LLM framework for comprehension and creation

Created 2 years ago

Updated 1 year ago

UniTok by FoundationVision

Unified tokenizer for visual generation and understanding research

Created 10 months ago

Updated 1 month ago

EVE by baaivision

Vision-language model research paper exploring encoder-free architectures

Created 1 year ago

Updated 5 months ago

Valley by bytedance

Advanced multimodal LLM for text, image, and video

Created 1 year ago

Updated 1 month ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI).

LLaVA-UHD by thunlp

Efficient native-resolution encoding for multimodal LLMs

Created 1 year ago

Updated 3 weeks ago

UniWorld by PKU-YuanGroup

Unified framework for visual tasks

Created 1 year ago

Updated 2 weeks ago

VLM2Vec by TIGER-AI-Lab

Research paper for multimodal embeddings using vision-language models

Created 1 year ago

Updated 3 weeks ago

LVM by ytongbai

Vision pretraining model for scalable learning using visual sentences

Created 2 years ago

Updated 1 year ago

HunyuanVideo-I2V by Tencent-Hunyuan

Image-to-video generation framework

Created 10 months ago

Updated 7 months ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

VLM_survey by jingyi0000

VLM survey paper with links to models/methods for vision tasks

Created 2 years ago

Updated 2 months ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

Bagel by ByteDance-Seed

Unified multimodal foundation model

Created 8 months ago

Updated 2 months ago

minimind-v by jingyaogong

VLM for training vision-language models from scratch

Created 1 year ago

Updated 2 weeks ago

Feedback? Help us improve.