ScreenAI by kyegomez

Vision-language model implementation for UI and infographics understanding

Created 1 year ago

372 stars

Top 76.1% on SourcePulse

Project Summary

ScreenAI provides a PyTorch implementation of a vision-language model designed for understanding user interfaces and infographics. It is targeted at researchers and developers working with multimodal AI for document analysis and visual content comprehension. The model aims to process both image and text inputs to extract meaningful information from complex visual layouts.

How It Works

The model follows a pipeline that begins with image patching and a Vision Transformer (ViT) for image encoding. Text is processed into embeddings. These image and text representations are then concatenated and passed through attention and feed-forward network layers. A crucial component is the cross-attention mechanism, which allows for interaction between visual and textual modalities, followed by further self-attention and feed-forward layers to produce the final output. This multimodal fusion approach is designed to capture the interplay between visual elements and accompanying text.

Quick Start & Requirements

Install via pip: pip3 install screenai
Requires PyTorch.
Usage example provided in the README demonstrates creating and running the model with random tensors.

Highlighted Details

Implements the ScreenAI model from the paper "A Vision-Language Model for UI and Infographics Understanding".
Processes image and text inputs through a structured pipeline involving ViT, attention, and feed-forward networks.
Includes a bibtex entry for citation.

Maintenance & Community

The project is maintained by kyegomez.
A "Todo" section indicates planned implementation of nn.ModuleList in the encoder and decoder.

Licensing & Compatibility

Licensed under the MIT License.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is in an early stage, with a "Todo" list indicating incomplete implementation of key architectural components like nn.ModuleList within the encoder and decoder.

ScreenAI by kyegomez

Explore Similar Projects

VLM-Visualizer by zjysteven

ru-dolph by ai-forever

Lumina-mGPT by Alpha-VLLM

fromage by kohjingyu

gill by kohjingyu

OMG-Seg by lxtGH

PandaGPT by yxuansu

Emu3 by baaivision

MiniGPT-4-ZH by RiseInRose

NExT-GPT by NExT-GPT

DeepSeek-VL2 by deepseek-ai

Janus by deepseek-ai