clip-pytorch  by bubbliiiing

CLIP for transferable visual-language models

Created 3 years ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This repository offers a PyTorch implementation of CLIP (Contrastive Language–Image Pre-training), designed to empower users in training transferable visual models directly from natural language supervision. It caters to researchers and developers aiming to adapt CLIP's capabilities to their specific, custom datasets, with explicit support for both Chinese and English languages. The project provides a practical framework for building and deploying bespoke vision-language models tailored for a wide array of downstream applications.

How It Works

The project meticulously implements CLIP within the PyTorch ecosystem, furnishing a comprehensive framework for training models on user-provided image-caption datasets. It features distinct, executable scripts for the entire lifecycle: training (train.py), inference/prediction (predict.py), and performance evaluation (eval.py). The core methodology revolves around contrastive learning, where image and text embeddings are learned simultaneously to align visual features with their corresponding natural language descriptions. This implementation emphasizes customizability, allowing users to move beyond standard pre-trained models and build upon foundational architectures referenced from established works like OpenAI's CLIP and Alibaba's AliceMind.

Quick Start & Requirements

  • Primary install/run command: Initiate training via train.py, perform inference with predict.py, and assess performance using eval.py.
  • Prerequisites: A stable installation of PyTorch version 1.7.1 or a more recent release.
  • Dependencies: Essential pre-trained weights and example datasets (such as Flickr8k) must be downloaded via provided Baidu Netdisk links.
  • Links:
    • Pre-trained weights: https://pan.baidu.com/s/1b9Nt-UuqOJfhbhJYVyrK0g (Code: mfnc)
    • Flickr8k dataset: https://pan.baidu.com/s/1UzaGmbEGz1BXZ0IXK1TT7g (Code: exg3)
    • OpenAI CLIP: https://github.com/openai/CLIP
    • AliceMind: https://github.com/alibaba/AliceMind

Highlighted Details

  • Explicitly supports the training of CLIP models on custom datasets, accommodating both Chinese and English language inputs.
  • Delivers a complete suite of scripts covering the entire workflow: model training, prediction/inference, and comprehensive evaluation.
  • Provides clear instructions and a defined JSON data format for preparing custom datasets, including image paths and multiple caption options.
  • Draws architectural inspiration and context from the foundational OpenAI CLIP and Alibaba AliceMind projects.

Maintenance & Community

The provided README does not contain specific information regarding project maintainers, community support channels (such as Discord or Slack), or a public roadmap, limiting visibility into project health and future development.

Licensing & Compatibility

Crucially, the README omits any mention of the project's software license. This lack of clarity prevents an assessment of its suitability for commercial applications or integration within proprietary, closed-source software.

Limitations & Caveats

  • The project's reliance on Baidu Netdisk for downloading critical assets like pre-trained weights and datasets may pose accessibility or long-term reliability concerns for users outside specific regions or those facing network restrictions.
  • The absence of a defined software license creates significant ambiguity regarding usage rights, particularly for commercial ventures or collaborative development.
  • Configuration for non-English datasets, such as Chinese, necessitates manual adjustments within the code (e.g., modifying the phi parameter), indicating a potential need for deeper technical understanding for multilingual use cases.
Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
375
Multimodal framework for vision-and-language transformer research
Created 4 years ago
Updated 3 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
1 more.

CLIP_prefix_caption by rmokady

0%
1k
Image captioning model using CLIP embeddings as a prefix
Created 4 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0%
4k
Open-source framework for training large multimodal models
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.