OS-Atlas  by OS-Copilot

Foundation action model for GUI agents (research paper)

Created 1 year ago
446 stars

Top 66.6% on SourcePulse

GitHubView on GitHub
Project Summary

OS-Atlas provides foundation models for generalist GUI agents, enabling them to understand and interact with graphical user interfaces. It addresses the need for agents that can precisely locate UI elements based on natural language instructions, outputting coordinates or bounding boxes for interaction.

How It Works

OS-Atlas leverages a dynamic image tiling approach for processing screenshots. Images are divided into a variable number of tiles based on aspect ratio to optimize for different screen layouts. These tiles are then fed into large vision-language models (VLMs) fine-tuned for grounding tasks. This method allows the model to handle diverse image resolutions and aspect ratios effectively while maintaining contextual information for precise localization.

Quick Start & Requirements

  • OS-Atlas-Base-4B: pip install transformers. Requires torch, torchvision. Inference example provided.
  • OS-Atlas-Base-7B: pip install transformers qwen-vl-utils. Requires torch. Inference example provided.
  • Both models accept images of any size. Outputs are normalized to a 0-1000 range.
  • Models are available on Hugging Face: OS-Copilot/OS-Atlas-Base-4B, OS-Copilot/OS-Atlas-Base-7B.
  • Homepage, Paper.

Highlighted Details

  • Foundation Action Model for Generalist GUI Agents.
  • Outputs normalized coordinates or bounding boxes for UI element interaction.
  • Supports dynamic image tiling for flexible input resolution handling.
  • Models fine-tuned from InternVL2-4B and Qwen2-VL-7B-Instruct.

Maintenance & Community

  • Paper accepted by ICLR 2025.
  • Models and data available on Hugging Face.

Licensing & Compatibility

  • License details are not explicitly stated in the README.

Limitations & Caveats

  • The README does not specify the license, which is crucial for commercial use or integration into closed-source projects.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Max Liu Max Liu(Cofounder of PingCAP), and
2 more.

ShowUI by showlab

0.4%
2k
Vision-language-action model for GUI agent & computer use (CVPR 2025 paper)
Created 1 year ago
Updated 1 month ago
Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
7 more.

CogVLM by zai-org

0.0%
7k
VLM for image understanding and multi-turn dialogue
Created 2 years ago
Updated 2 years ago
Feedback? Help us improve.