Introduction to ViTamin: A Frontier in Vision-Language Integration
Support and Accessibility
ViTamin is officially supported by popular platforms like timm and OpenCLIP, with substantial contributions from innovators such as @rwightman. HuggingFace has released a dedicated collection of ViTamin model cards, significantly making these models accessible to a global audience. These resources collectively provide useful pathways for engagement and application in various tasks.
Simple Integration
Integrating ViTamin into a project is incredibly straightforward, requiring just a single line of Python code:
model = timm.create_model('vitamin_xlarge_384')
Performance Highlights
ViTamin-XL sets a high bar with its efficient design, boasting only 436M parameters while achieving an 82.9% zero-shot ImageNet accuracy using the publicly available DataComp-1B dataset. In terms of open-vocabulary segmentation, ViTamin-L sets new standards across seven benchmarks, demonstrating its advanced capabilities in comparison to existing models. It continues to excel in the context of large multi-modal models, enhancing functionalities in areas like understanding vision-language paradigms.
Getting Started with ViTamin
Here's a look at the various tasks supported by ViTamin:
-
ViTamin Pre-training: Includes CLIP pre-training/fine-tuning pipelines and zero-shot evaluation, as detailed in the ViTamin README.
-
Open-vocabulary Detection and Segmentation: Offers techniques for open-vocabulary tasks, accessible through dedicated repositories like ViTamin for Open-vocab Detection and ViTamin for Open-vocab Segmentation.
-
Large Multi-Modal Models: Supports large-scale vision and language models, as explored through ViTamin for Large Multi-Modal Models.
Using ViTamin with Hugging Face
For practical implementation, ViTamin can be used via Hugging Face with the jienengchen/ViTamin-XL-384px
model. Below is a code snippet for setting up the model and conducting basic image processing tasks using PyTorch:
import torch
import open_clip
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(
'jienengchen/ViTamin-XL-384px',
trust_remote_code=True).to(device).eval()
image = Image.open('./image.png').convert('RGB')
image_processor = CLIPImageProcessor.from_pretrained('jienengchen/ViTamin-XL-384px')
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()
tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K')
text = tokenizer(["a photo of vitamin", "a dog", "a cat"]).to(device)
with torch.no_grad(), torch.cuda.amp.autocast():
image_features, text_features, logit_scale = model(pixel_values, text)
text_probs = (100.0 * image_features @ text_features.to(torch.float).T).softmax(dim=-1)
print("Label probs:", text_probs)
Achievements with CLIP Pre-training on DataComp-1B
In collaboration with the community, ViTamin presents 61 trained Vision-Language Models (VLMs), including 48 benchmarked models and 13 top-performers available on Hugging Face. These models are benchmarked on various datasets and exhibit impressive ImageNet accuracies and retrieval capabilities, showcasing the robustness and applicability of ViTamin in real-world scenarios.
Performance on Downstream Tasks
Open-Vocabulary Detection
ViTamin models demonstrate robust capabilities in open-vocabulary tasks such as object detection and segmentation across datasets like OV-COCO and OV-LVIS.
Image Encoder | Detector | OV-COCO | OV-LVIS |
---|---|---|---|
ViTamin-L | Sliding F-ViT | 37.5 | 35.6 |
Open-Vocabulary Segmentation
Achieving high mIoU metrics across multiple datasets exemplifies the model’s effectiveness in diverse environments.
Image Encoder | Segmentor | ADE | Cityscapes | MV | A-150 | A-847 | PC-459 | PC-59 | PAS-21 |
---|---|---|---|---|---|---|---|---|---|
ViTamin-L | Sliding FC-CLIP | 27.3 | 44.0 | 18.2 | 35.6 | 16.1 | 20.4 | 58.4 | 83.4 |
Large Multi-modal Models
ViTamin's integration in large multi-modal models provides substantial advancements in a range of applications like VQAv2, GQA, and more.
Image Encoder | Image Size | VQAv2 | GQA | VizWiz | SQA |
---|---|---|---|---|---|
ViTamin-L | 384 | 78.9 | 61.6 | 55.4 | 67.6 |
Conclusion
With its strategic design focusing on scalability and efficiency, ViTamin emerges as a pioneering vision model designed for modern vision-language challenges. It stands out with its ease of use, supreme performance metrics, and broad applicative reach, making it a key player in the advancement of vision-language integration.