UI-TARS  by bytedance

Multimodal agent for GUI interaction in virtual worlds (research paper)

created 6 months ago
6,784 stars

Top 7.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

UI-TARS is an open-source multimodal agent designed for complex tasks within virtual environments, targeting researchers and developers in AI agents and GUI automation. It offers advanced reasoning and adaptability, achieving state-of-the-art performance on various benchmarks.

How It Works

UI-TARS-1.5 is a vision-language model agent that leverages reinforcement learning for advanced reasoning, enabling it to "think before acting." This approach enhances performance and adaptability, particularly for inference-time scaling, and allows it to excel in diverse virtual world interactions.

Quick Start & Requirements

  • Deployment guides and coordinate processing details are available.
  • Requires significant computational resources.
  • Official quick-start and deployment information can be found via provided links.

Highlighted Details

  • Achieves state-of-the-art results on benchmarks like OSWorld (42.5), Windows Agent Arena (42.1), and ScreenSpot-V2 (94.2).
  • Demonstrates perfect scores (100%) across multiple Poki games, outperforming competitors like OpenAI CUA and Claude 3.7.
  • Shows improved performance in Minecraft tasks, with "UI-TARS-1.5 w/ Thought" achieving 0.42 average for mining blocks and 0.31 for killing mobs.
  • Offers different model scales, with UI-TARS-1.5 (7B) and UI-TARS-1.5-72B-DPO variants available for comparison.

Maintenance & Community

  • The project is actively developed by ByteDance.
  • A Discord server is available for community interaction.
  • Contact information for research access is provided.

Licensing & Compatibility

  • The project is open-source, with a citation provided for academic use. Specific license details are not explicitly stated in the README, but typical open-source usage is implied.

Limitations & Caveats

  • Potential for misuse due to advanced GUI task capabilities, including CAPTCHA navigation.
  • Requires substantial computational resources.
  • May exhibit hallucination, misidentification of GUI elements, or suboptimal actions in ambiguous environments.
  • The 7B model is not specifically optimized for game scenarios where the larger UI-TARS-1.5 model excels.
Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
20
Star History
1,228 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.