Discover and explore top open-source AI tools and projects—updated daily.
Foundation action model for GUI agents (research paper)
Top 75.0% on SourcePulse
OS-Atlas provides foundation models for generalist GUI agents, enabling them to understand and interact with graphical user interfaces. It addresses the need for agents that can precisely locate UI elements based on natural language instructions, outputting coordinates or bounding boxes for interaction.
How It Works
OS-Atlas leverages a dynamic image tiling approach for processing screenshots. Images are divided into a variable number of tiles based on aspect ratio to optimize for different screen layouts. These tiles are then fed into large vision-language models (VLMs) fine-tuned for grounding tasks. This method allows the model to handle diverse image resolutions and aspect ratios effectively while maintaining contextual information for precise localization.
Quick Start & Requirements
pip install transformers
. Requires torch
, torchvision
. Inference example provided.pip install transformers qwen-vl-utils
. Requires torch
. Inference example provided.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
5 months ago
Inactive