CLIP-Chinese  by yangjianxin1

CLIP model for Chinese multimodal tasks

Created 2 years ago
417 stars

Top 70.4% on SourcePulse

GitHubView on GitHub
Project Summary

CLIP-Chinese provides a pre-trained CLIP model specifically for Chinese language, enabling multimodal understanding tasks like image-text retrieval and similarity matching for Chinese users. It addresses the limitation of English-centric CLIP models by offering a ViT+BERT architecture trained on a large Chinese image-text dataset.

How It Works

The project implements a CLIP model with a ViT (Vision Transformer) encoder and a BERT-based text encoder. It utilizes a Locked-image Text (LiT) tuning strategy, freezing the ViT weights and training the BERT component on 1.4 million Chinese image-text pairs. This approach leverages OpenAI's CLIP ViT initialization and Mengzi's BERT pre-trained weights, aiming for efficient transfer learning and strong performance on Chinese multimodal tasks.

Quick Start & Requirements

  • Install: pip install transformers torch (specific versions: transformers==4.18.0, torch==1.12.0)
  • Prerequisites: Python 3.8, PyTorch.
  • Usage: Load pre-trained weights from Hugging Face (YeungNLP/clip-vit-bert-chinese-1M) using BertCLIPModel.from_pretrained and CLIPProcessor.from_pretrained.
  • Resources: Training requires significant computational resources (GPU recommended). Pre-trained models are available on Hugging Face.
  • Data: 1.4 million Chinese image-text pairs are available via the linked WeChat public account.

Highlighted Details

  • Offers pre-trained weights for the full CLIP model and a standalone BERT encoder.
  • Provides scripts for similarity calculation (image-text, text-text, image-image).
  • Demonstrates performance with example similarity scores.
  • Includes a data downloading script and configurable training parameters.

Maintenance & Community

  • Developed by yangjianxin1.
  • Pre-trained weights and data are shared via Hugging Face and a WeChat public account.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project's license is not specified, which may impact commercial adoption. The README notes that the image encoder's capabilities are primarily inherited from OpenAI's CLIP due to weight freezing during LiT tuning.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

fromage by kohjingyu

0%
482
Multimodal model for grounding language models to images
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.