OS-Atlas  by OS-Copilot

Foundation action model for GUI agents (research paper)

Created 10 months ago
380 stars

Top 75.0% on SourcePulse

GitHubView on GitHub
Project Summary

OS-Atlas provides foundation models for generalist GUI agents, enabling them to understand and interact with graphical user interfaces. It addresses the need for agents that can precisely locate UI elements based on natural language instructions, outputting coordinates or bounding boxes for interaction.

How It Works

OS-Atlas leverages a dynamic image tiling approach for processing screenshots. Images are divided into a variable number of tiles based on aspect ratio to optimize for different screen layouts. These tiles are then fed into large vision-language models (VLMs) fine-tuned for grounding tasks. This method allows the model to handle diverse image resolutions and aspect ratios effectively while maintaining contextual information for precise localization.

Quick Start & Requirements

  • OS-Atlas-Base-4B: pip install transformers. Requires torch, torchvision. Inference example provided.
  • OS-Atlas-Base-7B: pip install transformers qwen-vl-utils. Requires torch. Inference example provided.
  • Both models accept images of any size. Outputs are normalized to a 0-1000 range.
  • Models are available on Hugging Face: OS-Copilot/OS-Atlas-Base-4B, OS-Copilot/OS-Atlas-Base-7B.
  • Homepage, Paper.

Highlighted Details

  • Foundation Action Model for Generalist GUI Agents.
  • Outputs normalized coordinates or bounding boxes for UI element interaction.
  • Supports dynamic image tiling for flexible input resolution handling.
  • Models fine-tuned from InternVL2-4B and Qwen2-VL-7B-Instruct.

Maintenance & Community

  • Paper accepted by ICLR 2025.
  • Models and data available on Hugging Face.

Licensing & Compatibility

  • License details are not explicitly stated in the README.

Limitations & Caveats

  • The README does not specify the license, which is crucial for commercial use or integration into closed-source projects.
Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
7 more.

CogVLM by zai-org

0.0%
7k
VLM for image understanding and multi-turn dialogue
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.