Qwen3-SmVL by ShaohonChen

Combine Qwen3 and SmolVLM2 for Chinese multimodal understanding

Created 6 months ago

500 stars

Top 62.2% on SourcePulse

Project Summary

This repository presents a method for "stitching" together existing vision and language models to create a multimodal capability, specifically by combining the SmolVLM2 vision encoder with the Qwen3-0.6B language model. It targets users who want to imbue small language models with visual understanding, particularly in Chinese, without extensive architectural changes.

How It Works

The core approach involves replacing SmolVLM2's original language model with Qwen3-0.6B, including its tokenizer and language model head. This "stitching" process requires careful alignment of the vision model's output features to Qwen3's input dimensions via a new connector layer. Crucially, the chat template is adapted to integrate image tokens seamlessly into Qwen3's conversational format, preserving its existing capabilities like function calling.

Quick Start & Requirements

Installation: Clone the GitHub repository and install requirements using pip install -r requirements.txt.
Hardware: Requires GPUs with at least 40GB VRAM for training. The project details successful implementation on Metax C500 GPUs.
Dependencies: PyTorch (>=6.0), torchvision, transformers (>=4.53.0), accelerate, datasets, num2words.
Data: The "the_cauldron" dataset is used for fine-tuning.
Links: GitHub Repository, SwanLab Overview

Highlighted Details

Demonstrates a novel "stitching" technique for VLM creation.
Adapts SmolVLM2's architecture to Qwen3-0.6B, enabling Chinese multimodal understanding.
Successfully integrates image understanding while preserving Qwen3's original capabilities.
Provides detailed code explanations for model replacement, connector adaptation, and chat template modification.

Maintenance & Community

The project is authored by ShaohonChen. It references a collaborator for code review and testing. Links to SwanLab for training logs are provided.

Licensing & Compatibility

The README does not explicitly state a license. The project uses models from HuggingFace and Qwen, which have their own licenses. Compatibility for commercial use is not specified.

Limitations & Caveats

Training requires significant GPU VRAM (40GB+). The initial fine-tuning uses English datasets, with plans for Chinese data synthesis in future installments. Some sub-datasets within "the_cauldron" may require manual handling. The project focuses on the "stitching" method, with deeper analysis of dataset optimization and advanced fine-tuning techniques planned for subsequent posts.

Qwen3-SmVL by ShaohonChen

Explore Similar Projects

cobra by h-zhao1997

R1-Onevision by Fancy-MLLM

tiny-qwen by Emericen

Chinese-LLaVA by LinkSoul-AI

PandaGPT by yxuansu

UForm by unum-cloud

Emu3 by baaivision

Multimodal-GPT by open-mmlab

InternLM-XComposer by InternLM

smollm by huggingface

DeepSeek-VL by deepseek-ai

Janus by deepseek-ai