OS-Atlas by OS-Copilot

Foundation action model for GUI agents (research paper)

Created 1 year ago

436 stars

Top 68.4% on SourcePulse

Project Summary

OS-Atlas provides foundation models for generalist GUI agents, enabling them to understand and interact with graphical user interfaces. It addresses the need for agents that can precisely locate UI elements based on natural language instructions, outputting coordinates or bounding boxes for interaction.

How It Works

OS-Atlas leverages a dynamic image tiling approach for processing screenshots. Images are divided into a variable number of tiles based on aspect ratio to optimize for different screen layouts. These tiles are then fed into large vision-language models (VLMs) fine-tuned for grounding tasks. This method allows the model to handle diverse image resolutions and aspect ratios effectively while maintaining contextual information for precise localization.

Quick Start & Requirements

OS-Atlas-Base-4B: pip install transformers. Requires torch, torchvision. Inference example provided.
OS-Atlas-Base-7B: pip install transformers qwen-vl-utils. Requires torch. Inference example provided.
Both models accept images of any size. Outputs are normalized to a 0-1000 range.
Models are available on Hugging Face: OS-Copilot/OS-Atlas-Base-4B, OS-Copilot/OS-Atlas-Base-7B.
Homepage, Paper.

Highlighted Details

Foundation Action Model for Generalist GUI Agents.
Outputs normalized coordinates or bounding boxes for UI element interaction.
Supports dynamic image tiling for flexible input resolution handling.
Models fine-tuned from InternVL2-4B and Qwen2-VL-7B-Instruct.

Maintenance & Community

Paper accepted by ICLR 2025.
Models and data available on Hugging Face.

Licensing & Compatibility

License details are not explicitly stated in the README.

Limitations & Caveats

The README does not specify the license, which is crucial for commercial use or integration into closed-source projects.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

6 stars in the last 30 days

Explore Similar Projects

UGround by OSU-NLP-Group

GUI visual grounding for GUI agents

Created 1 year ago

Updated 7 months ago

SeeClick by njucckevin

Visual GUI agent for grounding and interacting with graphical user interfaces

Created 2 years ago

Updated 7 months ago

awesome-comfyui by ComfyUI-Workflow

ComfyUI custom nodes extend its capabilities for AI workflows

Created 1 year ago

Updated 7 months ago

Aria-UI by AriaUI

GUI agent for context-aware action grounding from instructions

Created 1 year ago

Updated 1 year ago

GUI-Actor by microsoft

Coordinate-free visual grounding for GUI agents

Created 9 months ago

Updated 2 weeks ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

CogAgent by zai-org

VLM-based GUI agent for automating graphical user interfaces

Created 2 years ago

Updated 10 months ago

Starred by

Max Liu

Max Liu(Cofounder of PingCAP),

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face), and

1 more.

ShowUI by showlab

Vision-language-action model for GUI agent & computer use (CVPR 2025 paper)

Created 1 year ago

Updated 1 month ago

Starred by

Abubakar Abid

Abubakar Abid(Cofounder of Gradio).

computer_use_ootb by showlab

GUI agent for Windows and macOS

Created 1 year ago

Updated 9 months ago

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect),

Ramin Hasani

Ramin Hasani(Cofounder of Liquid AI), and

1 more.

stability-sdk by Stability-AI

SDK for interacting with the Stability AI API

Created 3 years ago

Updated 6 months ago

Starred by

Alex Yu

Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI),

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI), and

7 more.

CogVLM by zai-org

VLM for image understanding and multi-turn dialogue

Created 2 years ago

Updated 1 year ago

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect),

Chaoyu Yang

Chaoyu Yang(Founder of Bento), and

11 more.

mistral-inference by mistralai

Inference library for Mistral models

Created 2 years ago

Updated 3 months ago

Starred by

Jason Huggins

Jason Huggins(Creator of Selenium),

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI), and

9 more.

OmniParser by microsoft

Screen parsing tool for vision-based GUI agents

Created 1 year ago

Updated 5 months ago

Feedback? Help us improve.