Discover and explore top open-source AI tools and projects—updated daily.
Multimodal reasoning and code execution for complex visual tasks
Top 63.3% on SourcePulse
Thyme is a multimodal large language model designed to go beyond traditional "thinking with images" by autonomously generating and executing image processing operations through code. It targets researchers and developers working with high-resolution perception and complex reasoning tasks, offering enhanced performance and sophisticated reasoning capabilities through a novel two-stage training strategy.
How It Works
Thyme employs a unique two-stage training process: supervised fine-tuning (SFT) followed by reinforcement learning (RL). The SFT stage focuses on teaching the model to generate executable code for image manipulation. The RL stage, utilizing the GRPO-ATS algorithm, further refines the model's ability to explore reasoning paths and precisely execute code, balancing exploration with accuracy. This approach allows Thyme to handle complex, multi-step image-based tasks that require both understanding and manipulation of visual data.
Quick Start & Requirements
sglang
, vllm
, transformers
, trl
, lmdeploy
, autoawq
, optimum
, bitsandbytes
, deepspeed
, flash-attn
, and ms-swift
.Highlighted Details
Maintenance & Community
The project is associated with Kwai and lists numerous authors, suggesting a well-supported research effort. Related projects like "Kwai Keye-VL" and "MM-RLHF" are also mentioned, indicating an active research group.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Users should verify licensing terms for commercial use or integration into closed-source projects.
Limitations & Caveats
The setup involves a substantial list of specific dependencies, which may require careful environment management. The README implies significant computational resources are needed for training, particularly for the RL stage. Path conversion for images and system integration needs meticulous attention during data preparation.
2 weeks ago
Inactive