Vitron  by SkyworkAI

Vision LLM research paper for pixel-level understanding, generation, segmentation & editing

created 1 year ago
558 stars

Top 58.3% on sourcepulse

GitHubView on GitHub
Project Summary

Vitron is a unified pixel-level Vision LLM designed to address limitations in existing vision LLMs, such as superficial instance-level understanding and lack of unified support for both images and videos. It offers comprehensive capabilities for perception, reasoning, generation, segmentation, and editing across static and dynamic visual content, targeting researchers and developers in computer vision and multimodal AI.

How It Works

Vitron employs a unified architecture that integrates various vision tasks at the pixel level. It leverages a multimodal encoder and projector to bridge vision and language modalities, enabling it to process and generate pixel-level outputs. The design aims for a holistic understanding and manipulation of visual data, moving beyond object-centric or frame-centric approaches.

Quick Start & Requirements

  • Installation: Clone the repository, activate a Python 3.10 conda environment, and install dependencies using pip install -e . and pip install -e ".[train]". Additional installations include flash-attn, decord, opencv-python, and pytorchvideo.
  • Prerequisites: Python >= 3.8, PyTorch == 2.1.0, CUDA >= 11.8. Specific installation troubleshooting for ffmpeg, detectron2, gradio, and deepspeed is provided.
  • Resources: Requires significant computational resources for training and potentially for running the model, as indicated by the need for CUDA and specific deep learning libraries.
  • Links: Repository, Dataset

Highlighted Details

  • Accepted to NeurIPS 2024.
  • Releases checkpoints and a dataset for Text Invocation Instruction Tuning.
  • Supports image and video understanding, generation, segmentation, and editing.
  • Integrates modules from related projects like GLIGEN, SEEM, i2vgen-xl, and StableVideo.

Maintenance & Community

The project is actively developed by Skywork AI, National University of Singapore, and Nanyang Technological University. Updates are announced via the repository.

Licensing & Compatibility

The majority of the project is released under the Apache 2.0 license. However, the service is intended for non-commercial use only, subject to the LLaMA model license, OpenAI's data terms, and ShareGPT's privacy practices.

Limitations & Caveats

The non-commercial use restriction due to underlying model licenses may limit commercial adoption. The installation process involves several specific dependencies and potential troubleshooting steps, indicating a complex setup.

Health Check
Last commit

9 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
32 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Luca Antiga Luca Antiga(CTO of Lightning AI).

mmagic by open-mmlab

0.1%
7k
AIGC toolbox for image/video editing and generation
created 6 years ago
updated 1 year ago
Feedback? Help us improve.