Chinese-CLIP - Optimize Chinese Image-Text Features and Similarity with Chinese-CLIP

Introduction to Chinese-CLIP

Chinese-CLIP is a project designed to extend the capabilities of the CLIP model into the Chinese language domain. Utilizing around 2 billion pairs of images and text in Chinese, the project aims to facilitate rapid implementation of tasks such as image-text feature similarity calculation, cross-modal retrieval, and zero-shot image classification. Built on the foundation of the open_clip project, Chinese-CLIP optimizes for Chinese datasets and offers superior performance in this context.

The project offers a variety of resources including APIs, training code, and testing code, making it easy for users to dive into their specific applications.

Recent Updates

The Chinese-CLIP project has been evolving constantly with significant updates such as:

Pytorch to CoreML Conversion: On November 30, 2023, a script was added to convert Pytorch models to CoreML format for easier deployment.
Knowledge Distillation Support: As of September 8, 2023, it supports ModelScope-based knowledge distillation fine-tuning.
Compatibility with Pytorch 2.0: The model was adapted on May 9, 2023.
Enhanced Gradient Accumulation: Introduced on March 20, 2023, to simulate larger batch sizes for more efficient training.
FlashAttention Support: On February 16, 2023, FlashAttention was added to increase training speeds and reduce memory usage.
Deployment with ONNX and TensorRT: Support was available as of January 15, 2023, including pre-trained TensorRT models for faster feature inference.
FLIP Training Strategy: Added on December 12, 2022, for fine-tuning training processes.

Available Models and Download

Chinese-CLIP has released five model variants, each tailored with different scales and architectures:

Model	Download Link	Parameters	Vision Backbone	Vision Parameters	Text Backbone	Text Parameters	Resolution
CN-CLIP_RN50	Download	77M	ResNet50	38M	RBT3	39M	224
CN-CLIP_ViT-B/16	Download	188M	ViT-B/16	86M	RoBERTa-wwm-Base	102M	224
CN-CLIP_ViT-L/14	Download	406M	ViT-L/14	304M	RoBERTa-wwm-Base	102M	224
CN-CLIP_{ViT-L/14@336px}	Download	407M	ViT-L/14	304M	RoBERTa-wwm-Base	102M	336
CN-CLIP_ViT-H/14	Download	958M	ViT-H/14	632M	RoBERTa-wwm-Large	326M	224

Experiment Results

Chinese-CLIP has been benchmarked against several tasks, showcasing its robust performance in experiments such as:

MUGE Text-to-Image Retrieval: Displays significant improvements over baseline models in both zero-shot and fine-tuned contexts.
Flickr30K-CN Retrieval: Outshines in text-to-image and image-to-text retrieval tasks.
COCO-CN Retrieval: Achieves high metrics in zero-shot and fine-tuned setups.
Zero-shot Image Classification: Excels in various datasets under ELEVATER benchmark conditions.

Getting Started

Installation Requirements

Before starting with Chinese-CLIP, ensure that your environment meets these requirements:

Python >= 3.6.4
PyTorch >= 1.8.0 (with torchvision >= 0.9.0)
CUDA Version >= 10.2

Install the necessary third-party libraries with the following command:

pip install -r requirements.txt

Quick Start with API

To utilize the Chinese-CLIP API, follow this straightforward code example:

First, install the cn_clip package:

# Install via pip
pip install cn_clip

# Or install from source
cd Chinese-CLIP
pip install -e .

Once installed, you can easily call the API to extract feature vectors from images and calculate similarities to input texts:

import torch
from PIL import Image
import cn_clip.clip as clip
from cn_clip.clip import load_from_name, available_models

print("Available models:", available_models())
# Available models: ['ViT-B-16', 'ViT-L-14', 'ViT-L-14-336', 'ViT-H-14', 'RN50']

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = load_from_name("ViT-B-16", device=device, download_root='./')
model.eval()
image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0).to(device)
text = clip.tokenize(["杰尼龟"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

# Calculate similarity score
similarity = torch.cosine_similarity(image_features, text_features)

This guide provides a comprehensive introduction to the Chinese-CLIP project, from its inception and latest updates to practical steps for implementation.