Thyme  by yfzhang114

Multimodal reasoning and code execution for complex visual tasks

Created 1 month ago
486 stars

Top 63.3% on SourcePulse

GitHubView on GitHub
Project Summary

Thyme is a multimodal large language model designed to go beyond traditional "thinking with images" by autonomously generating and executing image processing operations through code. It targets researchers and developers working with high-resolution perception and complex reasoning tasks, offering enhanced performance and sophisticated reasoning capabilities through a novel two-stage training strategy.

How It Works

Thyme employs a unique two-stage training process: supervised fine-tuning (SFT) followed by reinforcement learning (RL). The SFT stage focuses on teaching the model to generate executable code for image manipulation. The RL stage, utilizing the GRPO-ATS algorithm, further refines the model's ability to explore reasoning paths and precisely execute code, balancing exploration with accuracy. This approach allows Thyme to handle complex, multi-step image-based tasks that require both understanding and manipulation of visual data.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies using Conda and pip. Key dependencies include sglang, vllm, transformers, trl, lmdeploy, autoawq, optimum, bitsandbytes, deepspeed, flash-attn, and ms-swift.
  • Prerequisites: Python 3.10, Conda environment, and potentially significant GPU resources for training.
  • Data Preparation: Requires downloading datasets from HuggingFace and updating local file paths within the dataset JSON files.
  • Links: Home Page, Technique Report.

Highlighted Details

  • Achieves enhanced performance on high-resolution perception and complex reasoning tasks.
  • Utilizes a novel two-stage training strategy (SFT + RL) with the GRPO-ATS algorithm.
  • Supports autonomous generation and execution of diverse image processing operations.
  • Includes evaluation scripts using VLMEvalKit for benchmarking.

Maintenance & Community

The project is associated with Kwai and lists numerous authors, suggesting a well-supported research effort. Related projects like "Kwai Keye-VL" and "MM-RLHF" are also mentioned, indicating an active research group.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing terms for commercial use or integration into closed-source projects.

Limitations & Caveats

The setup involves a substantial list of specific dependencies, which may require careful environment management. The README implies significant computational resources are needed for training, particularly for the RL stage. Path conversion for images and system integration needs meticulous attention during data preparation.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
9
Star History
258 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.