CLIP-Chinese by yangjianxin1

CLIP model for Chinese multimodal tasks

Created 3 years ago

421 stars

Top 69.8% on SourcePulse

Project Summary

CLIP-Chinese provides a pre-trained CLIP model specifically for Chinese language, enabling multimodal understanding tasks like image-text retrieval and similarity matching for Chinese users. It addresses the limitation of English-centric CLIP models by offering a ViT+BERT architecture trained on a large Chinese image-text dataset.

How It Works

The project implements a CLIP model with a ViT (Vision Transformer) encoder and a BERT-based text encoder. It utilizes a Locked-image Text (LiT) tuning strategy, freezing the ViT weights and training the BERT component on 1.4 million Chinese image-text pairs. This approach leverages OpenAI's CLIP ViT initialization and Mengzi's BERT pre-trained weights, aiming for efficient transfer learning and strong performance on Chinese multimodal tasks.

Quick Start & Requirements

Install: pip install transformers torch (specific versions: transformers==4.18.0, torch==1.12.0)
Prerequisites: Python 3.8, PyTorch.
Usage: Load pre-trained weights from Hugging Face (YeungNLP/clip-vit-bert-chinese-1M) using BertCLIPModel.from_pretrained and CLIPProcessor.from_pretrained.
Resources: Training requires significant computational resources (GPU recommended). Pre-trained models are available on Hugging Face.
Data: 1.4 million Chinese image-text pairs are available via the linked WeChat public account.

Highlighted Details

Offers pre-trained weights for the full CLIP model and a standalone BERT encoder.
Provides scripts for similarity calculation (image-text, text-text, image-image).
Demonstrates performance with example similarity scores.
Includes a data downloading script and configurable training parameters.

Maintenance & Community

Developed by yangjianxin1.
Pre-trained weights and data are shared via Hugging Face and a WeChat public account.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project's license is not specified, which may impact commercial adoption. The README notes that the image encoder's capabilities are primarily inherited from OpenAI's CLIP due to weight freezing during LiT tuning.

CLIP-Chinese by yangjianxin1

Explore Similar Projects

CrossFlow by qihao067

e4t-diffusion by mkshing

METER by zdou0830

fromage by kohjingyu

Visual-Chinese-LLaMA-Alpaca by airaria

SkyPaint-AI-Diffusion by SkyWorkAIGC

ClipCap-Chinese by yangjianxin1

Awesome-CLIP by yzhuoning

Mengzi by Langboat

Show-o by showlab

OpenAI-CLIP by moein-shariatnia

open_flamingo by mlfoundations