ScreenAI  by kyegomez

Vision-language model implementation for UI and infographics understanding

Created 1 year ago
365 stars

Top 77.1% on SourcePulse

GitHubView on GitHub
Project Summary

ScreenAI provides a PyTorch implementation of a vision-language model designed for understanding user interfaces and infographics. It is targeted at researchers and developers working with multimodal AI for document analysis and visual content comprehension. The model aims to process both image and text inputs to extract meaningful information from complex visual layouts.

How It Works

The model follows a pipeline that begins with image patching and a Vision Transformer (ViT) for image encoding. Text is processed into embeddings. These image and text representations are then concatenated and passed through attention and feed-forward network layers. A crucial component is the cross-attention mechanism, which allows for interaction between visual and textual modalities, followed by further self-attention and feed-forward layers to produce the final output. This multimodal fusion approach is designed to capture the interplay between visual elements and accompanying text.

Quick Start & Requirements

  • Install via pip: pip3 install screenai
  • Requires PyTorch.
  • Usage example provided in the README demonstrates creating and running the model with random tensors.

Highlighted Details

  • Implements the ScreenAI model from the paper "A Vision-Language Model for UI and Infographics Understanding".
  • Processes image and text inputs through a structured pipeline involving ViT, attention, and feed-forward networks.
  • Includes a bibtex entry for citation.

Maintenance & Community

  • The project is maintained by kyegomez.
  • A "Todo" section indicates planned implementation of nn.ModuleList in the encoder and decoder.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is in an early stage, with a "Todo" list indicating incomplete implementation of key architectural components like nn.ModuleList within the encoder and decoder.

Health Check
Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
3
Issues (30d)
1
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

fromage by kohjingyu

0%
482
Multimodal model for grounding language models to images
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.