Combine Qwen3 and SmolVLM2 for Chinese multimodal understanding
Top 99.6% on SourcePulse
This repository presents a method for "stitching" together existing vision and language models to create a multimodal capability, specifically by combining the SmolVLM2 vision encoder with the Qwen3-0.6B language model. It targets users who want to imbue small language models with visual understanding, particularly in Chinese, without extensive architectural changes.
How It Works
The core approach involves replacing SmolVLM2's original language model with Qwen3-0.6B, including its tokenizer and language model head. This "stitching" process requires careful alignment of the vision model's output features to Qwen3's input dimensions via a new connector layer. Crucially, the chat template is adapted to integrate image tokens seamlessly into Qwen3's conversational format, preserving its existing capabilities like function calling.
Quick Start & Requirements
pip install -r requirements.txt
.Highlighted Details
Maintenance & Community
The project is authored by ShaohonChen. It references a collaborator for code review and testing. Links to SwanLab for training logs are provided.
Licensing & Compatibility
The README does not explicitly state a license. The project uses models from HuggingFace and Qwen, which have their own licenses. Compatibility for commercial use is not specified.
Limitations & Caveats
Training requires significant GPU VRAM (40GB+). The initial fine-tuning uses English datasets, with plans for Chinese data synthesis in future installments. Some sub-datasets within "the_cauldron" may require manual handling. The project focuses on the "stitching" method, with deeper analysis of dataset optimization and advanced fine-tuning techniques planned for subsequent posts.
2 weeks ago
Inactive