UI-TARS is an open-source multimodal agent designed for complex tasks within virtual environments, targeting researchers and developers in AI agents and GUI automation. It offers advanced reasoning and adaptability, achieving state-of-the-art performance on various benchmarks.
How It Works
UI-TARS-1.5 is a vision-language model agent that leverages reinforcement learning for advanced reasoning, enabling it to "think before acting." This approach enhances performance and adaptability, particularly for inference-time scaling, and allows it to excel in diverse virtual world interactions.
Quick Start & Requirements
- Deployment guides and coordinate processing details are available.
- Requires significant computational resources.
- Official quick-start and deployment information can be found via provided links.
Highlighted Details
- Achieves state-of-the-art results on benchmarks like OSWorld (42.5), Windows Agent Arena (42.1), and ScreenSpot-V2 (94.2).
- Demonstrates perfect scores (100%) across multiple Poki games, outperforming competitors like OpenAI CUA and Claude 3.7.
- Shows improved performance in Minecraft tasks, with "UI-TARS-1.5 w/ Thought" achieving 0.42 average for mining blocks and 0.31 for killing mobs.
- Offers different model scales, with UI-TARS-1.5 (7B) and UI-TARS-1.5-72B-DPO variants available for comparison.
Maintenance & Community
- The project is actively developed by ByteDance.
- A Discord server is available for community interaction.
- Contact information for research access is provided.
Licensing & Compatibility
- The project is open-source, with a citation provided for academic use. Specific license details are not explicitly stated in the README, but typical open-source usage is implied.
Limitations & Caveats
- Potential for misuse due to advanced GUI task capabilities, including CAPTCHA navigation.
- Requires substantial computational resources.
- May exhibit hallucination, misidentification of GUI elements, or suboptimal actions in ambiguous environments.
- The 7B model is not specifically optimized for game scenarios where the larger UI-TARS-1.5 model excels.