Vitron by SkyworkAI

Vision LLM research paper for pixel-level understanding, generation, segmentation & editing

Created 1 year ago

579 stars

Top 55.9% on SourcePulse

Project Summary

Vitron is a unified pixel-level Vision LLM designed to address limitations in existing vision LLMs, such as superficial instance-level understanding and lack of unified support for both images and videos. It offers comprehensive capabilities for perception, reasoning, generation, segmentation, and editing across static and dynamic visual content, targeting researchers and developers in computer vision and multimodal AI.

How It Works

Vitron employs a unified architecture that integrates various vision tasks at the pixel level. It leverages a multimodal encoder and projector to bridge vision and language modalities, enabling it to process and generate pixel-level outputs. The design aims for a holistic understanding and manipulation of visual data, moving beyond object-centric or frame-centric approaches.

Quick Start & Requirements

Installation: Clone the repository, activate a Python 3.10 conda environment, and install dependencies using pip install -e . and pip install -e ".[train]". Additional installations include flash-attn, decord, opencv-python, and pytorchvideo.
Prerequisites: Python >= 3.8, PyTorch == 2.1.0, CUDA >= 11.8. Specific installation troubleshooting for ffmpeg, detectron2, gradio, and deepspeed is provided.
Resources: Requires significant computational resources for training and potentially for running the model, as indicated by the need for CUDA and specific deep learning libraries.
Links: Repository, Dataset

Highlighted Details

Accepted to NeurIPS 2024.
Releases checkpoints and a dataset for Text Invocation Instruction Tuning.
Supports image and video understanding, generation, segmentation, and editing.
Integrates modules from related projects like GLIGEN, SEEM, i2vgen-xl, and StableVideo.

Maintenance & Community

The project is actively developed by Skywork AI, National University of Singapore, and Nanyang Technological University. Updates are announced via the repository.

Licensing & Compatibility

The majority of the project is released under the Apache 2.0 license. However, the service is intended for non-commercial use only, subject to the LLaMA model license, OpenAI's data terms, and ShareGPT's privacy practices.

Limitations & Caveats

The non-commercial use restriction due to underlying model licenses may limit commercial adoption. The installation process involves several specific dependencies and potential troubleshooting steps, indicating a complex setup.

Vitron by SkyworkAI

Explore Similar Projects

VARGPT by VARGPT-family

LongVA by EvolvingLMMs-Lab

GPT4Scene-and-VLN-R1 by Qi-Zhangyang

SEED-X by AILab-CVC

VideoTuna by VideoVerses

PandaGPT by yxuansu

UNO by bytedance

VideoLLaMA3 by DAMO-NLP-SG

Awesome-LLMs-for-Video-Understanding by yunlong10

InternLM-XComposer by InternLM

HunyuanVideo-I2V by Tencent-Hunyuan

DeepSeek-VL2 by deepseek-ai