VisionLLaMA by Meituan-AutoML

Vision transformer research paper

Created 1 year ago

390 stars

Top 73.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

VisionLLaMA presents a unified LLaMA-like transformer backbone for diverse vision tasks, including perception and generation. It aims to provide a strong, generic baseline for vision research by adapting the successful transformer architecture from LLMs to image processing.

How It Works

VisionLLaMA adapts the transformer architecture, fundamental to Large Language Models like LLaMA, for 2D image processing. It introduces both plain and pyramid forms of this LLaMA-like vision transformer, specifically tailored for visual data. This unified approach allows for a single model to handle a wide array of vision tasks, potentially offering substantial gains over existing vision transformers.

Quick Start & Requirements

Pre-training instructions are available in PRETRAIN.md.
Specific instructions for ImageNet 1k Supervised Training, ADE 20k Segmentation, and COCO Detection are provided in separate files.
Details for DiTLLaMA and SiTLLaMA are in their respective markdown files.

Highlighted Details

Unified LLaMA-like backbone for vision tasks.
Plain and pyramid variants available.
Evaluated on image perception and generation tasks.
Claims substantial gains over prior state-of-the-art vision transformers.

Maintenance & Community

The project is associated with ECCV2024. Further community or maintenance details are not specified in the provided README.

Licensing & Compatibility

The license type and compatibility for commercial or closed-source use are not specified in the provided README.

Limitations & Caveats

The README does not detail specific limitations, known bugs, or the project's maturity level (e.g., alpha/beta status). Compatibility for commercial use is also not clarified.

VisionLLaMA by Meituan-AutoML

Explore Similar Projects

lens by ContextualAI

VLM-Visualizer by zjysteven

Lumina-mGPT by Alpha-VLLM

ICT by raywzy

OMG-Seg by lxtGH

LLaVA-Plus-Codebase by LLaVA-VL

VisionLLM by OpenGVLab

Vary by Ucas-HaoranWei

Awesome-Visual-Transformer by dk-liang

Awesome-Transformer-Attention by cmhungsteve

taming-transformers by CompVis

vision_transformer by google-research