Multilingual-CLIP  by FreddeFrallan

Multilingual text encoders leveraging OpenAI's CLIP model

Created 4 years ago
811 stars

Top 43.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides pre-trained CLIP text encoders for multiple languages, enabling cross-lingual and multilingual text-image retrieval. It's designed for researchers and developers working with multimodal AI who need to bridge language barriers in visual understanding tasks. The project offers a significant advantage by allowing users to leverage CLIP's powerful image-text matching capabilities across a wide array of languages without requiring extensive multilingual image-text datasets.

How It Works

The project adapts OpenAI's CLIP architecture by retraining its text encoder using a cross-lingual teacher learning approach. This method utilizes machine translation to align text from various languages with English CLIP embeddings, effectively transferring CLIP's visual-textual understanding to new languages. This approach bypasses the need for target-language image-text pairs, making it efficient and scalable for low-resource languages.

Quick Start & Requirements

  • Install: pip install multilingual-clip (or pip install tensorflow)
  • Requirements: Python >= 3.6.9, Transformers = 4.8.1. PyTorch or TensorFlow.
  • Demo: Live Demo
  • Colab Notebook: Multilingual_CLIP.ipynb

Highlighted Details

  • Offers multiple pre-trained models based on different transformer architectures (e.g., XLM-R, LaBSE) and vision models (e.g., ViT-L/14, ViT-B/32).
  • Provides benchmark results on MS-COCO for various languages, showing competitive performance.
  • Includes code for both PyTorch and TensorFlow inference, as well as TensorFlow training.
  • Supports a wide range of languages, with specific models fine-tuned for up to 109 languages.

Maintenance & Community

  • Contact: Fredrik.Carlsson@ri.se for questions.
  • Contributions are welcomed for new language models.
  • Acknowledgements include Stability.ai for compute resources.

Licensing & Compatibility

  • License: MIT License.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

Legacy models require separate download of linear weights and may not be as integrated as newer Huggingface-hosted models. Performance can vary across languages, particularly for those with less representation in the training data.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.