Qwen3-SmVL  by ShaohonChen

Combine Qwen3 and SmolVLM2 for Chinese multimodal understanding

created 1 month ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository presents a method for "stitching" together existing vision and language models to create a multimodal capability, specifically by combining the SmolVLM2 vision encoder with the Qwen3-0.6B language model. It targets users who want to imbue small language models with visual understanding, particularly in Chinese, without extensive architectural changes.

How It Works

The core approach involves replacing SmolVLM2's original language model with Qwen3-0.6B, including its tokenizer and language model head. This "stitching" process requires careful alignment of the vision model's output features to Qwen3's input dimensions via a new connector layer. Crucially, the chat template is adapted to integrate image tokens seamlessly into Qwen3's conversational format, preserving its existing capabilities like function calling.

Quick Start & Requirements

  • Installation: Clone the GitHub repository and install requirements using pip install -r requirements.txt.
  • Hardware: Requires GPUs with at least 40GB VRAM for training. The project details successful implementation on Metax C500 GPUs.
  • Dependencies: PyTorch (>=6.0), torchvision, transformers (>=4.53.0), accelerate, datasets, num2words.
  • Data: The "the_cauldron" dataset is used for fine-tuning.
  • Links: GitHub Repository, SwanLab Overview

Highlighted Details

  • Demonstrates a novel "stitching" technique for VLM creation.
  • Adapts SmolVLM2's architecture to Qwen3-0.6B, enabling Chinese multimodal understanding.
  • Successfully integrates image understanding while preserving Qwen3's original capabilities.
  • Provides detailed code explanations for model replacement, connector adaptation, and chat template modification.

Maintenance & Community

The project is authored by ShaohonChen. It references a collaborator for code review and testing. Links to SwanLab for training logs are provided.

Licensing & Compatibility

The README does not explicitly state a license. The project uses models from HuggingFace and Qwen, which have their own licenses. Compatibility for commercial use is not specified.

Limitations & Caveats

Training requires significant GPU VRAM (40GB+). The initial fine-tuning uses English datasets, with plans for Chinese data synthesis in future installments. Some sub-datasets within "the_cauldron" may require manual handling. The project focuses on the "stitching" method, with deeper analysis of dataset optimization and advanced fine-tuning techniques planned for subsequent posts.

Health Check
Last commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
10
Star History
243 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.