Thyme by yfzhang114

Multimodal reasoning and code execution for complex visual tasks

Created 4 months ago

551 stars

Top 58.0% on SourcePulse

Project Summary

Thyme is a multimodal large language model designed to go beyond traditional "thinking with images" by autonomously generating and executing image processing operations through code. It targets researchers and developers working with high-resolution perception and complex reasoning tasks, offering enhanced performance and sophisticated reasoning capabilities through a novel two-stage training strategy.

How It Works

Thyme employs a unique two-stage training process: supervised fine-tuning (SFT) followed by reinforcement learning (RL). The SFT stage focuses on teaching the model to generate executable code for image manipulation. The RL stage, utilizing the GRPO-ATS algorithm, further refines the model's ability to explore reasoning paths and precisely execute code, balancing exploration with accuracy. This approach allows Thyme to handle complex, multi-step image-based tasks that require both understanding and manipulation of visual data.

Quick Start & Requirements

Installation: Clone the repository and install dependencies using Conda and pip. Key dependencies include sglang, vllm, transformers, trl, lmdeploy, autoawq, optimum, bitsandbytes, deepspeed, flash-attn, and ms-swift.
Prerequisites: Python 3.10, Conda environment, and potentially significant GPU resources for training.
Data Preparation: Requires downloading datasets from HuggingFace and updating local file paths within the dataset JSON files.
Links: Home Page, Technique Report.

Highlighted Details

Achieves enhanced performance on high-resolution perception and complex reasoning tasks.
Utilizes a novel two-stage training strategy (SFT + RL) with the GRPO-ATS algorithm.
Supports autonomous generation and execution of diverse image processing operations.
Includes evaluation scripts using VLMEvalKit for benchmarking.

Maintenance & Community

The project is associated with Kwai and lists numerous authors, suggesting a well-supported research effort. Related projects like "Kwai Keye-VL" and "MM-RLHF" are also mentioned, indicating an active research group.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing terms for commercial use or integration into closed-source projects.

Limitations & Caveats

The setup involves a substantial list of specific dependencies, which may require careful environment management. The README implies significant computational resources are needed for training, particularly for the RL stage. Path conversion for images and system integration needs meticulous attention during data preparation.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

21 stars in the last 30 days