Introducing SkyPaint-AI-Diffusion
SkyPaint-AI-Diffusion, developed by Singularity, is an ongoing bilingual project that transforms text into images. By using this model, users can input text in either Chinese or English, and see it converted into a modern art-style image.
Showcase of Capabilities
Here are some examples of the artistic images generated using SkyPaint:
-
Mechanical Dog:
-
Castle, Sea, Sunset, Miyazaki Animation:
-
How Many Falling Flowers:
-
Half Chicken Half Human, Strong:
-
Chicken, You're Too Beautiful:
Try It Out
To experience SkyPaint firsthand, visit SkyPaint Online. Alternatively, you can use your WeChat app to scan the provided QR code and try it on a mini-program:
Advantages of the Model
The SkyPaint model consists of two key components: a text encoder and a diffusion model. The project enhances these elements in two main steps:
- Text Encoder Optimization: Built upon OpenAI-CLIP, the text encoder is optimized for bilingual recognition in both Chinese and English.
- Diffusion Model Enhancement: The diffusion model is improved to produce high-quality images with a distinct modern art flair.
Features
- Supports text inputs in Chinese, English, and mixed languages.
- Generates high-quality images with modern artistic styles.
- Compatible with official stable_diffusion_1.x models and related fine-tuned models with English text prompts.
- Retains methods and practices typical of stable_diffusion prompts.
Example Usage
Here's how you can use the SkyPaint model in a Python script:
from diffusers import StableDiffusionPipeline
device = 'cuda'
pipe = StableDiffusionPipeline.from_pretrained("path_to_our_model").to(device)
prompts = [
'机械狗',
'城堡 大海 夕阳 宫崎骏动画',
'花落知多少',
'鸡你太美',
]
for prompt in prompts:
prompt = 'sai-v1 art, ' + prompt
image = pipe(prompt).images[0]
image.save("%s.jpg" % prompt)
SkyCLIP Model Overview
SkyCLIP is a model using a training method for bilingual CLIP processes that significantly lowers data requirements and computing power needed. This enables easy reproduction or fine-tuning by the open-source community. The method adjusts only the OpenAI-CLIP’s text encoder but can still utilize its image encoder for text-image retrieval functions.
SkyCLIP Training Data
SkyCLIP's training sources include:
- Parallel corpora from machine translation tasks
- United Nations bilingual corpora
- Portions of LAION and Wukong datasets
- AI-Challenger translation datasets
- Chinese literature
- Common combinations from prompt booklets
SkyCLIP Training Approach
The method uses the OpenAI-CLIP’s text_encoder as a teacher model with parameters frozen. The student's model is a multi-language BERT of similar size. Text features from the student are trained to approximate those of the teacher using various loss functions, ensuring increased alignment of Chinese and English encodings.
SkyCLIP Evaluation
SkyCLIP's zero-shot performance on Flickr30K-CN was evaluated against other multilingual models:
Flickr30K-CN Retrieval:
Task | Text-to-Image | Image-to-Text | MR |
---|---|---|---|
Taiyi-326M | R@1: 53.8, R@5: 79.9, R@10: 86.6 | R@1: 64.0, R@5: 90.4, R@10: 96.1 | 78.47 |
AltCLIP | R@1: 50.7, R@5: 75.4, R@10: 83.1 | R@1: 73.4, R@5: 92.8, R@10: 96.9 | 78.72 |
Wukong | R@1: 51.9, R@5: 78.6, R@10: 85.9 | R@1: 75.0, R@5: 94.4, R@10: 97.7 | 80.57 |
R2D2 | R@1: 42.6, R@5: 69.5, R@10: 78.6 | R@1: 63.0, R@5: 90.1, R@10: 96.4 | 73.37 |
CN-CLIP | R@1: 68.1, R@5: 89.7, R@10: 94.5 | R@1: 80.2, R@5: 96.6, R@10: 98.2 | 87.87 |
SkyCLIP | R@1: 58.8, R@5: 82.6, R@10: 89.6 | R@1: 78.8, R@5: 96.1, R@10: 98.3 | 84.04 |
Calculating Image-Text Similarity with SkyCLIP
Here's an example Python code to calculate the similarity between images and text using SkyCLIP:
from PIL import Image
import requests
import clip
import torch
from transformers import BertTokenizer
from transformers import CLIPProcessor, CLIPModel, CLIPTextModel
import numpy as np
query_texts = ['一个人', '一辆汽车', '两个男人', '两个女人'] # Replace these prompts as desired.
text_tokenizer = BertTokenizer.from_pretrained("./tokenizer")
text_encoder = CLIPTextModel.from_pretrained("./text_encoder").eval()
text = text_tokenizer(query_texts, return_tensors='pt', padding=True)['input_ids']
url = "http://images.cocodataset.org/val2017/000000040083.jpg"
clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
clip_text_proj = clip_model.text_projection
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
image = processor(images=Image.open(requests.get(url, stream=True).raw), return_tensors="pt")
with torch.no_grad():
image_features = clip_model.get_image_features(**image)
text_features = text_encoder(text)[0]
sep_index = torch.nonzero(text == student_tokenizer.sep_token_id)
text_features = text_features[torch.arange(text.shape[0]), sep_index[:, 1]]
text_features = clip_text_proj(text_features)
image_features = image_features / image_features.norm(dim=1, keepdim=True)
text_features = text_features / text_features.norm(dim=1, keepdim=True)
logit_scale = clip_model.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
logits_per_text = logits_per_image.t()
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print(np.around(probs, 3))
Diffusion Model Explanation
SkyPaint's data is sourced from filtered Laion datasets. The model includes a 'sai-v1 art' tag to enhance learning efficiency for the desired style and quality. It uses the stable-diffusion-v1-5 base model, trained using 16 A100 units for 50 hours. Continuous updates are planned for further stabilization and enhancement.
License
SkyPaint is available under the CreativeML Open RAIL-M License.
Join the Developer Community
Join the developer group by scanning the QR code with WeChat:
If you are interested, don't forget to star this project!