SkyPaint-AI-Diffusion - Enhances Efficiency by Highlighting Seamless Integration of Bilingual Text Inputs to Generate Modern Art Images

Introducing SkyPaint-AI-Diffusion

SkyPaint-AI-Diffusion, developed by Singularity, is an ongoing bilingual project that transforms text into images. By using this model, users can input text in either Chinese or English, and see it converted into a modern art-style image.

Showcase of Capabilities

Here are some examples of the artistic images generated using SkyPaint:

Mechanical Dog:
Castle, Sea, Sunset, Miyazaki Animation:
How Many Falling Flowers:
Half Chicken Half Human, Strong:
Chicken, You're Too Beautiful:

Try It Out

To experience SkyPaint firsthand, visit SkyPaint Online. Alternatively, you can use your WeChat app to scan the provided QR code and try it on a mini-program:

gh_0e89c7c92d3f_430

Advantages of the Model

The SkyPaint model consists of two key components: a text encoder and a diffusion model. The project enhances these elements in two main steps:

Text Encoder Optimization: Built upon OpenAI-CLIP, the text encoder is optimized for bilingual recognition in both Chinese and English.
Diffusion Model Enhancement: The diffusion model is improved to produce high-quality images with a distinct modern art flair.

Features

Supports text inputs in Chinese, English, and mixed languages.
Generates high-quality images with modern artistic styles.
Compatible with official stable_diffusion_1.x models and related fine-tuned models with English text prompts.
Retains methods and practices typical of stable_diffusion prompts.

Example Usage

Here's how you can use the SkyPaint model in a Python script:

from diffusers import StableDiffusionPipeline

device = 'cuda'  
pipe = StableDiffusionPipeline.from_pretrained("path_to_our_model").to(device)

prompts = [
    '机械狗',
    '城堡 大海 夕阳 宫崎骏动画',
    '花落知多少',
    '鸡你太美',
]

for prompt in prompts:
    prompt = 'sai-v1 art, ' + prompt
    image = pipe(prompt).images[0]  
    image.save("%s.jpg" % prompt)

SkyCLIP Model Overview

SkyCLIP is a model using a training method for bilingual CLIP processes that significantly lowers data requirements and computing power needed. This enables easy reproduction or fine-tuning by the open-source community. The method adjusts only the OpenAI-CLIP’s text encoder but can still utilize its image encoder for text-image retrieval functions.

SkyCLIP Training Data

SkyCLIP's training sources include:

Parallel corpora from machine translation tasks
United Nations bilingual corpora
Portions of LAION and Wukong datasets
AI-Challenger translation datasets
Chinese literature
Common combinations from prompt booklets

SkyCLIP Training Approach

The method uses the OpenAI-CLIP’s text_encoder as a teacher model with parameters frozen. The student's model is a multi-language BERT of similar size. Text features from the student are trained to approximate those of the teacher using various loss functions, ensuring increased alignment of Chinese and English encodings.

SkyCLIP Evaluation

SkyCLIP's zero-shot performance on Flickr30K-CN was evaluated against other multilingual models:

Flickr30K-CN Retrieval:

Task	Text-to-Image	Image-to-Text	MR
Taiyi-326M	R@1: 53.8, R@5: 79.9, R@10: 86.6	R@1: 64.0, R@5: 90.4, R@10: 96.1	78.47
AltCLIP	R@1: 50.7, R@5: 75.4, R@10: 83.1	R@1: 73.4, R@5: 92.8, R@10: 96.9	78.72
Wukong	R@1: 51.9, R@5: 78.6, R@10: 85.9	R@1: 75.0, R@5: 94.4, R@10: 97.7	80.57
R2D2	R@1: 42.6, R@5: 69.5, R@10: 78.6	R@1: 63.0, R@5: 90.1, R@10: 96.4	73.37
CN-CLIP	R@1: 68.1, R@5: 89.7, R@10: 94.5	R@1: 80.2, R@5: 96.6, R@10: 98.2	87.87
SkyCLIP	R@1: 58.8, R@5: 82.6, R@10: 89.6	R@1: 78.8, R@5: 96.1, R@10: 98.3	84.04

Calculating Image-Text Similarity with SkyCLIP

Here's an example Python code to calculate the similarity between images and text using SkyCLIP:

from PIL import Image
import requests
import clip
import torch
from transformers import BertTokenizer
from transformers import CLIPProcessor, CLIPModel, CLIPTextModel
import numpy as np

query_texts = ['一个人', '一辆汽车', '两个男人', '两个女人']  # Replace these prompts as desired.
text_tokenizer = BertTokenizer.from_pretrained("./tokenizer")
text_encoder = CLIPTextModel.from_pretrained("./text_encoder").eval()
text = text_tokenizer(query_texts, return_tensors='pt', padding=True)['input_ids']

url = "http://images.cocodataset.org/val2017/000000040083.jpg"
clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
clip_text_proj = clip_model.text_projection
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
image = processor(images=Image.open(requests.get(url, stream=True).raw), return_tensors="pt")

with torch.no_grad():
    image_features = clip_model.get_image_features(**image)
    text_features = text_encoder(text)[0]
    sep_index = torch.nonzero(text == student_tokenizer.sep_token_id)
    text_features = text_features[torch.arange(text.shape[0]), sep_index[:, 1]]
    text_features = clip_text_proj(text_features)
    image_features = image_features / image_features.norm(dim=1, keepdim=True)
    text_features = text_features / text_features.norm(dim=1, keepdim=True)
    logit_scale = clip_model.logit_scale.exp()
    logits_per_image = logit_scale * image_features @ text_features.t()
    logits_per_text = logits_per_image.t()
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()
    print(np.around(probs, 3))

Diffusion Model Explanation

SkyPaint's data is sourced from filtered Laion datasets. The model includes a 'sai-v1 art' tag to enhance learning efficiency for the desired style and quality. It uses the stable-diffusion-v1-5 base model, trained using 16 A100 units for 50 hours. Continuous updates are planned for further stabilization and enhancement.

License

SkyPaint is available under the CreativeML Open RAIL-M License.

Join the Developer Community

Join the developer group by scanning the QR code with WeChat:

paint

If you are interested, don't forget to star this project!