Introduction to Chinese-CLIP
Chinese-CLIP is a project designed to extend the capabilities of the CLIP model into the Chinese language domain. Utilizing around 2 billion pairs of images and text in Chinese, the project aims to facilitate rapid implementation of tasks such as image-text feature similarity calculation, cross-modal retrieval, and zero-shot image classification. Built on the foundation of the open_clip project, Chinese-CLIP optimizes for Chinese datasets and offers superior performance in this context.
The project offers a variety of resources including APIs, training code, and testing code, making it easy for users to dive into their specific applications.
Recent Updates
The Chinese-CLIP project has been evolving constantly with significant updates such as:
-
Pytorch to CoreML Conversion: On November 30, 2023, a script was added to convert Pytorch models to CoreML format for easier deployment.
-
Knowledge Distillation Support: As of September 8, 2023, it supports ModelScope-based knowledge distillation fine-tuning.
-
Compatibility with Pytorch 2.0: The model was adapted on May 9, 2023.
-
Enhanced Gradient Accumulation: Introduced on March 20, 2023, to simulate larger batch sizes for more efficient training.
-
FlashAttention Support: On February 16, 2023, FlashAttention was added to increase training speeds and reduce memory usage.
-
Deployment with ONNX and TensorRT: Support was available as of January 15, 2023, including pre-trained TensorRT models for faster feature inference.
-
FLIP Training Strategy: Added on December 12, 2022, for fine-tuning training processes.
Available Models and Download
Chinese-CLIP has released five model variants, each tailored with different scales and architectures:
Model | Download Link | Parameters | Vision Backbone | Vision Parameters | Text Backbone | Text Parameters | Resolution |
---|---|---|---|---|---|---|---|
CN-CLIPRN50 | Download | 77M | ResNet50 | 38M | RBT3 | 39M | 224 |
CN-CLIPViT-B/16 | Download | 188M | ViT-B/16 | 86M | RoBERTa-wwm-Base | 102M | 224 |
CN-CLIPViT-L/14 | Download | 406M | ViT-L/14 | 304M | RoBERTa-wwm-Base | 102M | 224 |
CN-CLIPViT-L/14@336px | Download | 407M | ViT-L/14 | 304M | RoBERTa-wwm-Base | 102M | 336 |
CN-CLIPViT-H/14 | Download | 958M | ViT-H/14 | 632M | RoBERTa-wwm-Large | 326M | 224 |
Experiment Results
Chinese-CLIP has been benchmarked against several tasks, showcasing its robust performance in experiments such as:
-
MUGE Text-to-Image Retrieval: Displays significant improvements over baseline models in both zero-shot and fine-tuned contexts.
-
Flickr30K-CN Retrieval: Outshines in text-to-image and image-to-text retrieval tasks.
-
COCO-CN Retrieval: Achieves high metrics in zero-shot and fine-tuned setups.
-
Zero-shot Image Classification: Excels in various datasets under ELEVATER benchmark conditions.
Getting Started
Installation Requirements
Before starting with Chinese-CLIP, ensure that your environment meets these requirements:
- Python >= 3.6.4
- PyTorch >= 1.8.0 (with torchvision >= 0.9.0)
- CUDA Version >= 10.2
Install the necessary third-party libraries with the following command:
pip install -r requirements.txt
Quick Start with API
To utilize the Chinese-CLIP API, follow this straightforward code example:
First, install the cn_clip
package:
# Install via pip
pip install cn_clip
# Or install from source
cd Chinese-CLIP
pip install -e .
Once installed, you can easily call the API to extract feature vectors from images and calculate similarities to input texts:
import torch
from PIL import Image
import cn_clip.clip as clip
from cn_clip.clip import load_from_name, available_models
print("Available models:", available_models())
# Available models: ['ViT-B-16', 'ViT-L-14', 'ViT-L-14-336', 'ViT-H-14', 'RN50']
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = load_from_name("ViT-B-16", device=device, download_root='./')
model.eval()
image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0).to(device)
text = clip.tokenize(["杰尼龟"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Calculate similarity score
similarity = torch.cosine_similarity(image_features, text_features)
This guide provides a comprehensive introduction to the Chinese-CLIP project, from its inception and latest updates to practical steps for implementation.