CLIP model for Chinese multimodal tasks
Top 71.5% on sourcepulse
CLIP-Chinese provides a pre-trained CLIP model specifically for Chinese language, enabling multimodal understanding tasks like image-text retrieval and similarity matching for Chinese users. It addresses the limitation of English-centric CLIP models by offering a ViT+BERT architecture trained on a large Chinese image-text dataset.
How It Works
The project implements a CLIP model with a ViT (Vision Transformer) encoder and a BERT-based text encoder. It utilizes a Locked-image Text (LiT) tuning strategy, freezing the ViT weights and training the BERT component on 1.4 million Chinese image-text pairs. This approach leverages OpenAI's CLIP ViT initialization and Mengzi's BERT pre-trained weights, aiming for efficient transfer learning and strong performance on Chinese multimodal tasks.
Quick Start & Requirements
pip install transformers torch
(specific versions: transformers==4.18.0
, torch==1.12.0
)YeungNLP/clip-vit-bert-chinese-1M
) using BertCLIPModel.from_pretrained
and CLIPProcessor.from_pretrained
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project's license is not specified, which may impact commercial adoption. The README notes that the image encoder's capabilities are primarily inherited from OpenAI's CLIP due to weight freezing during LiT tuning.
2 years ago
Inactive