ScreenAI  by kyegomez

Vision-language model implementation for UI and infographics understanding

created 1 year ago
354 stars

Top 79.9% on sourcepulse

GitHubView on GitHub
Project Summary

ScreenAI provides a PyTorch implementation of a vision-language model designed for understanding user interfaces and infographics. It is targeted at researchers and developers working with multimodal AI for document analysis and visual content comprehension. The model aims to process both image and text inputs to extract meaningful information from complex visual layouts.

How It Works

The model follows a pipeline that begins with image patching and a Vision Transformer (ViT) for image encoding. Text is processed into embeddings. These image and text representations are then concatenated and passed through attention and feed-forward network layers. A crucial component is the cross-attention mechanism, which allows for interaction between visual and textual modalities, followed by further self-attention and feed-forward layers to produce the final output. This multimodal fusion approach is designed to capture the interplay between visual elements and accompanying text.

Quick Start & Requirements

  • Install via pip: pip3 install screenai
  • Requires PyTorch.
  • Usage example provided in the README demonstrates creating and running the model with random tensors.

Highlighted Details

  • Implements the ScreenAI model from the paper "A Vision-Language Model for UI and Infographics Understanding".
  • Processes image and text inputs through a structured pipeline involving ViT, attention, and feed-forward networks.
  • Includes a bibtex entry for citation.

Maintenance & Community

  • The project is maintained by kyegomez.
  • A "Todo" section indicates planned implementation of nn.ModuleList in the encoder and decoder.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is in an early stage, with a "Todo" list indicating incomplete implementation of key architectural components like nn.ModuleList within the encoder and decoder.

Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.